Load Forecasting Tutorial (part 2): Exploratory Data Analysis
Updated: May 28, 2020
Welcome to the second part of the blog series about Load Forecasting. In this series of tutorials, I will guide you through the whole process of a load forecasting workflow, from preparing the data to building a machine learning model. I will provide a lot of tips and tricks that I have found useful throughout the time.
Dataset together with a Juypter notebook is available here.
Exploratory data analysis (EDA) is one of the most important parts of machine learning workflow since it allows you to understand your data. According to Wikipedia EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. This means: looking at the data from various perspectives and trying to understand it. Later you can use this knowledge to build better Machine Learning models by creating a good feature engineering.
Our input data consist of a load which is also the target variable that we want to predict together with temperature representing the external feature.
Input data description:
Load time series in MW for 5 years with an hourly resolution for a substation that supplies approx. 10.000 small consumers together with a few larger industrial consumers.
Temperature from the same area and for the same period of time.
Electricity load depends on three major factors:
Calendar (e.g. consumption depends on the hour of the day, day of the week and time of the year).
Weather (especially temperature, this is due to heating and cooling devices).
Economic growth (load generally grows every year).
In the following steps, I will explain major visualizations that are crucial for understanding the data, before machine learning modeling.
Plot no. 1: just do it!
At first, just make a simple graph by plotting the load time series against time and you will quickly get a sense of what is going on. This gives you an overview and you are able to see strange behavior such as temporary supply reconfiguration, spikes or drops right away. It is also useful to plot a 30-day moving average to observe seasonal variations and trend.
Plot no. 2: dive in!
One mistake I made often at the beginning is that when I learned all these fancy plots in Python I got stuck in it sometimes. I was more focused on coding and making fancy visualizations than on actual data.
Therefore, now I try to dive in as soon as possible and manually explore the data, just to get a feeling about the data and later make various kinds of fancy visualizations. I suggest using the Plotly library for exploring time series data plotted against time.
A typical weekly load profile for an analyzed substation is on a graph below. The load is usually higher during working days and lower during the weekends due to the mix of industrial and residential consumers.
Plot no. 3: it’s all about daily profiles!
Before modeling day type in our machine learning model, it is crucial to understand daily load behavior. As I have mentioned before the load depends on the hour of the day, the day of the week and the time of the year. For understanding how load behaves on a daily basis, I suggest plotting mean daily profiles for every day of the week and for every season (or month – depends on your data) separately – can be seen on graphs below. This enables observing how days of the week differ or are similar to each other.
In the figures below, we can observe slightly different behavior between seasons, thus it will be crucial to incorporate this in our machine learning model. The load is generally higher in wintertime and the shape of daily profiles also slightly changes, since in Slovenia a lot of people use natural gas for heating in winter and electricity for cooling in summer.
Next, it can be seen that Mondays have a lower load in the morning than other working days and that Saturdays and Sundays also differ from other days. Therefore, modeling Monday, Saturday and Sunday separately will be crucial, whereas other working days can be grouped together.
We can also see that the load is low during the nighttime when people sleep. During working days a morning peak occurs at around 8 am when people wake up and go to work. An evening peak occurs when people come from work and start cooking, this is at 6 pm or 7 pm (daylight saving time is not incorporated, thus actual peak in summer occurs at 8 pm local time).
One interesting fact about this substation is that the load during working days is quite flat, which is due to a mix of bigger industrial consumers, that have a larger load during the daytime and households, that have a larger load in the morning and the evening.
In general, the load is lower during the weekends, because industrial consumers mainly do not work. This also leads to a different shape of a daily profile on weekends, which is not that flat during the day. Morning peaks during weekends occur later because people wake up later in the morning and we can also see a bigger drop in the load during the middle of the day since there are no industrial consumers consuming the electricity.
Plots no. 4: load vs. temperature
Weather (especially temperature) is one of the major factors affecting the load, thus it is crucial to analyze it. First, I suggest plotting the load and temperature against time (plot below). To see the data clearly and observe seasonal variations, rather plot weekly peak load and mean weakly temperature instead of using original resolution (1 hour).
Next, let’s observe the correlation in figure below (left part). The left side shows the correlation of the load and the temperature with hourly resolution colored depending on a day type. It can be seen that the correlation is quite poor, which is due to a different relationship between the load and the temperature during different hours. Thus, the data can be resampled to daily resolution in order to average out daily variations between hours, which is shown on the right side of the figure. Now, the double hockey stick can be seen, which is the consequence of different loads at working days and weekends at the same temperature. Throughout the winter there is a negative correlation between daily peak loads and mean daily temperatures due to heating. This changes in summer, when there is a positive correlation due to cooling.
Non-linear correlation between load and temperature is well known in practice. Here, I want to emphasize that when using Multiple Linear Regression for load forecasting load-temperature correlation has to be modeled using 3rd polynomial because it is not symmetrical such as 2nd degree. With 3rd polynomial degree, it is also possible to capture the saturation in the leftmost and rightmost side (the lowest or the highest temperatures). When temperatures are low, load usually increases, whereas after a certain threshold it starts saturating a bit. This happens due to the fact, that we have a final amount of heating devices in buildings. Therefore, after let’s say -10 degrees we turn on everything on maximum, and it doesn’t matter whether temperatures decrease even more – we are unable to heat more.
Our machine learning model will have to predict the load for every hour, thus the correlation between the daily load and the daily temperature cannot be used (but it still good for the understanding). We have to investigate the correlation using an hourly resolution.
Further, the load-temperature correlation is analyzed for every season, day type and hour separately.
In the graph below correlations can be seen for 9 pm. When observing the data for a chosen hour, the correlation is much stronger, which leads to a better machine learning model performance if incorporated appropriately into the model.
In this blog, I have guided you through the most essential visualizations that have to be performed in order to understand the data before machine learning modeling. You can try exploring further the correlation between the load and the moving average temperature such as an average temperature of the last day or the last week. This is usually the main cause of annual peak load – low temperatures lasting longer time periods. Explore it further and let me know in the comments below, what you have learned.
The next blog is about developing a load forecasting benchmark and understanding the intuition behind Multiple Linear Regression.
If you find this blog useful, please share it with others and let me know your thoughts on a LinkedIn!
If you want to connect with other experts working in this field, join my LinkedIn group AI in Smart Grids where I post about this topic!