Load Forecasting Tutorial (part 1): Data Preparation

Welcome to the first part of the blog series about Load Forecasting. In this series of tutorials, I will guide you through the whole process of a load forecasting workflow, from preparing the data to building a machine learning model. I will provide a lot of tips and tricks that I have found useful throughout the time.

The tutorial consists of the following blogs:

Data preparation
Exploratory Data Analysis
How to develop a benchmark model?
How to evaluate model performance?
How to improve a benchmark model?
How to develop a Neural Network model?

A common rule of thumb says that about 80 % of the time in data science projects is spent on data preparation and only 20 % on machine learning modeling. I think that this highly depends on the industry. When speaking about the electrical energy industry I can confidently say that this is true. A lot of utilities around the world have still not implemented data processing systems. This means that you will have to gather the data first (usually from different data sources) and clean it before using it.

These are common anomalies found in energy datasets:

Strange behavior of the load (e.g. due to temporary supply reconfiguration etc.)
Outliers
Missing values
Duplicated timestamps
Missing timestamps

The figure below shows an example of a supply reconfiguration between March 15 and March 23. There are various causes for this such as disconnection of a larger industrial consumer, reconfiguration of a feeder supply (from one substation to another substation), etc.

In the figure below you can see a spike (outlier) and an example of missing observations.

I will not go to the details about how to handle all aforementioned anomalies, because it depends on a lot of factors and there is no general rule on how to do it (maybe I write a detailed blog about outlier detection and missing values imputation later).

One thing that I have to emphasize is: Beware of data leakage.

Data leakage occurs when you use knowledge from the “future” (test set representing unseen data) to create the model. This will result in an outstanding model performance on a test set, whereas in a production environment the performance will be worse.

If you clean your whole dataset first and then apply machine learning, you are cheating! Why? Data processing has to be embedded in a pipeline together with machine learning algorithms. This pipeline has to be built on a train set and evaluated on a test set. It seems reasonable, but a lot of people make this mistake without even noticing it.

Especially when you are working with people in the industry which are trying to help you and do not have experience with building machine learning models this can occur faster than you thing.

Remember, people tend to overfit!

Forecasting workflow consists of the following steps:

Gathering the data
Data preparation
Exploratory data analysis
Feature Engineering
Tuning Machine Learning algorithms
Evaluating the performance
Putting the model into the production environment

Tip, that will make your life easier

First focus on coding the whole workflow from gathering and preparing the data, using simple machine learning algorithms and evaluate the performance (more about general machine learning workflow here). This will allow you to get a bigger picture of your problem and you will be able to focus on the things that seem most important for improving the performance whether this is data preparation, tuning machine learning algorithms or something else.

At the beginning of your project use simple data preparation (such as dropping missing or anomalous observations) and basic algorithms – make predictions as soon as possible. Another very useful approach is to analyze train and validation errors to spot if something strange is going on.

The main suggestion: do not spend to much time on any of the aforementioned steps at the beginning until you find out which parts are most important. Then, go back to the identified bottlenecks and improve what seems most important. Remember that machine learning modeling is a repetitive process.

Conclusion

In this blog, I didn’t go to the details about the data preparation, whereas I wanted to point out a few topics that I find very important to understand to make a good data preparation pipeline.

The next blog is about Exploratory Data Analysis, which is a crucial topic, since it enables a better understanding of the data and consequently better feature engineering in machine learning modeling.

If you find this blog useful, please share it with others and let me know your thoughts on a LinkedIn!

If you want to connect with other experts working in this field, join my LinkedIn group AI in Smart Grids where I post about this topic!