[Prevision.io R SDK] Basic data ingestion

 
by Florian Laroumagne
 



Hey folks, if you are here, you already have Prevision.io’s R SDK installed and are ready to go. If not, please check our first blog post about setting up your environment. 

In this blog post, we are going to see how we can easily push local data into Prevision.io thanks to the R SDK. The first thing to consider is that data sets, just like any resources involved in a Machine Learning project, belong to - guess what - a project! 🙃

 

Authentication to Prevision.io’s instance

First, we will make sure that you have loaded the SDK and established the connection to your Prevision.io instance. To do so please type: 

library(previsionio)
pio_url = "https://<INSTANCE_NAME>.prevision.io"
pio_tkn = "<TOKEN>"
pio_init(pio_tkn, pio_url)

Replacing <INSTANCE_NAME> and <TOKEN> with the appropriate value. If you’re not sure what this means, feel free to refresh your memory with the first blog post in this series. 

 

Project creation

Once done, we can create our project. Let’s name it “Electricity Forecast”. To do so type the following: 

project = create_project(name = "Electricity Forecast", description = "R SDK Demo project")

We can verify that everything is fine by going to Prevision.io’s UI or by using the get_projects() function.

 



Fresh project created

 

If, by any chance, you want to share your project with someone on your Prevision.io’s instance, feel free to do it! We do offer some collaboration capabilities and right management.

To share your project with a mate, simply type the following:

create_project_user(project_id = project$`_id`,
                    user_mail = "email@email.com",
                    user_role = "admin")

Make sure to write the email of your mate and specify the user_role from the following choices:

  • admin, for complete access

  • contributor, read & write access but can’t demote admin

  • viewer, read-only access

     

Dataset import

As of now, the project is totally empty. We need to fill it with some data in order to move forward. In order to facilitate this tutorial, I have already prepared a training and a validation (holdout) dataset for you. Here they are:

  • Training data set : link

  • Validation data set : link
     

The training data set is about the electricity consumption of France on a 30’ time step starting from 2014-01-01 and ending on 2020-12-31. The testing dataset starts from the 2021-01-01 and ends the 2021-09-30.

Each dataset has 7 features :

  • TS, the time stamp

  • TARGET, the actual consumption of electricity (in MW)

  • PUBLIC_HOLIDAY, boolean, 1 if (french) public holiday 0 otherwise

  • TEMPERATURE, mean of temperature across France in °C

  • LAG_1, 1 day lag of TARGET value

  • LAG_7, 7 days lag of TARGET value

  • fold, technical identifier used for cross validation strategy, based on year of TS
     

Because this kind of use case is sensitive to temperature and also to special days, we have a good starting point here even if we could get more features in order to obtain a better model. The point of this tutorial is to keep things easy 🤓 (even if the final app I’ll showcase is based on a slightly more complex model with more features involved).

So, what should you do with these datasets? Well, you can first load them into your R environment using your favorite library (I love data.table personally) and explore datasets. For instance, here is a very quick sample that I have done with plotly for instance that displays the consumption distribution on the holdout dataset:

library(data.table)
library(plotly)

train = fread("C:/Users/Florian/Desktop/elec_train.csv") # Make this match the filepath
valid = fread("C:/Users/Florian/Desktop/elec_valid.csv") # Make this match the filepath

plot_ly(valid, x = ~ TS) %>%
  add_trace(
    y = ~ TARGET,
    name = 'ACTUAL CONSUMPTION',
    type = 'scatter',
    mode = 'lines',
    line = list(color = '#19293C', width = 1),
    showlegend = TRUE
  ) %>%
  layout(
    xaxis = list(
      title = "Time",
      gridcolor = 'rgb(255,255,255)',
      zeroline = FALSE
    ),
    yaxis = list(
      title = "Consumption (MW)",
      gridcolor = 'rgb(255,255,255)',
      zeroline = FALSE
    ),
    legend = list(x = 0.92, y = 0.1),
    title = 'Electricity consumption (MW) in France on the validation data'
  )

 



 

Electricity consumption (MW) in France on the validation data

 

I’ll let you play around with them for a bit, but if you want to leverage the Prevision.io platform, then you will need to import the data into the freshly created project. To do so, execute the following:

pio_train = create_dataset_from_dataframe(project_id = project$`_id`,
                                          dataset_name = "train",
                                          dataframe = train,
                                          zip = TRUE)


pio_valid = create_dataset_from_dataframe(project_id = project$`_id`,
                                          dataset_name = "valid",
                                          dataframe = valid,
                                          zip = FALSE)

This function will, as its name suggests, create a dataset into Prevision.io coming from a R data frame that is loaded in memory with some options:

  • project_id, is the id of the project in which the data set should belong to. Have in mind that if you have forgotten the project_id, you can still retrieve it thanks to the get_project_id_from_name() function that will take a project’s name and return the corresponding id 🤓

  • dataset_name, the desired name of the dataset you are importing

  • dataframe, the name of the R data frame

  • zip, boolean argument that will zip the dataframe before sending it
     

After a while, you will see in Prevision.io’s UI that your project has 2 fresh datasets. Please note that this step will take some time to complete because:

  • Dataframe will be compressed locally before being uploaded to the server (can be disabled if you set the zip argument to FALSE)

  • The zip will then be uploaded to the platform (time will depend on your connection, hence zipping might be a good idea for big datasets, especially if you have slow internet speeds)

  • Dataset in Prevision.io are parsed and automatically analysed. Also, statistics on them will be pre-computed.  This step is clearly the longest one, especially for big datasets
     

Also, we could have used the create_dataset_from_file() function if we wanted to avoid the reading phase of the dataset into your R environment, which again, is convenient for high volumetric data.

 



Imported & parsed datasets into my own environment


 

If you want to see this list directly from your R environment, you’ll need to type the following function:

ds = get_datasets(project_id = project$`_id`)
 

One last thing to keep in mind: if you want to retrieve information about a specific dataset, you can get them by using the get_dataset_info() function that expects a dataset_id, which can be get thanks to get_datasets() [see above] or more easily, thanks to the get_dataset_id_from_name() function.
 

Now that data sets are being imported & parsed into Prevision.io, you can access statistics directly from the UI or just move to the next blog post series in which we will make Machine Learning models on them 🧐


 

ABOUT PREVISION.IO

Prevision.io brings powerful AI management capabilities to data science users so more AI projects make it into production and stay in production. Our purpose-built AI Management platform was designed by data scientists for data scientists and citizen data scientists to scale their value, domain expertise, and impact. The platform manages the hidden complexities and burdensome tasks that get in the way of realizing the tremendous productivity and performance gains AI can deliver across your business.

If you want to find the other articles of this series: 

- Part 1 - Introduction & Setup
- Part 2 - Basic Data Ingestion
- Part 3 - Experiment tracking using AutoML
- Part 4 - Model Deployment
- Part 5 - Pipelines Overview
- Part 6 - Apps Deployment
- Part 7 - Model Lifecycle & Conclusion


 


Suivez-nous sur Twitter ! Suivez-nous sur Linkedin ! Suivez-nous sur Instagram !