Welcome to the Virtual Sensor example

One of the most interesting industrial AI solutions is virtual sensing. It can be seen as the beginning of a digital twin since if you can predict behavior virtually in many steps, you can build up your digital twin step-by-step. The application examples and benefits that result are many, including shorter time-to-market cycles, product cost savings, reduced maintenance risks, etc.

Within this example, I will investigate a dataset shared here. The dataset contains 11 variables of a gas turbine aggregated for five years. The data is provided in an hourly sampling rate using average or sum.

Data Exploration

First of all, I want to get an overview of the dataset. Therefore I take a look at the statistics of the entire dataset. There I concatenated the yearly folders together in one big table and added a Year-Variable:

The provided dataset is very clean and well prepared. It contains no flaws or infinity entries, and all 11 variables have the same length. Let's discover the variables further. To understand the variable and its units, we can read the documentation of the dataset provider, and we can find the following abbreviations and units:

Abbreviation	Variable	Unit
AT	Ambient temperatur	°C
AP	Ambient pressure	mbar
AH	Ambient humidity	%
AFDP	Air filter difference pressure	mbar
GTEP	Gas turbine exhaust pressure	mbar
TIT	Turbine inlet temperature	°C
TAT	Turbine outlet temperature	°C
TEY	Turbine energy yield	MWH
CDP	Compressor discharge pressure	mbar
CO	Carbon monoxide	mg/m^3
NOX	Nitrogen oxides	mg/m^3

With the following image, we can see where each sensor is placed.

Furthermore, it is noticeable that the measurements do not have a date, but from the data description, it is clear that the data are in chronological order. However, the length of each year does not match up with 8760 (365days x 24h); I can not reconstruct the DateTime retrospectively. An average of 1412h (59days) of data per year is missing. I could not find any explanation for that, but this is crucial information to consider to build valid models in a real-world problem. Still, the seasonable effects are, for example, clearly seeable in ambient temperature globally (summer/winter) and locally (day/night):

Now let's compare the different years with each other. Since we are looking forward to training models for the CO and NOX emission, let's discuss the following pictures first:

Within that diagrams, you can see the median, quantile 25, and the quantile 75 of CO/NOX over the years. Ideally, we wouldn't see a clear, distinguishable difference over the years. It is wise to split the train and test dataset accordingly so that the test set is not significantly different from the trainset. Especially the NOX production somehow dropped significantly from 2013 to 2014. Most of the orange and blue boxes from 2014-2015 are almost entirely below the orange and blue boxes of the years before, and the median value dropped from around 67 m/m^3 to 58 m/m^3. This effect can also be seen in ambient humidity in 2015, see appendix. Therefore, we will split 2011-2015, excluding 2012 for the trainset, and only 2012 for the manual test set.

Distributions:

None of the variables is normally distributed, especially CO, and NOX is highly skewed.

After performing a Box-Cox transformation, look more like normal distributions, which allows us to apply a wider range of algorithms and models to train. As an example, see the following distributions:

Let's take a look at the Pearson correlation:

We can see that TEY, TIT, CDP, and GTEP are highly correlated to each other. Most important is the correlation to the emissions, which I can visualize in a table.

CO is highly correlated to CDP, TEY, TAT, TIT, GTEP, and AFDP – all process variables but AFDP. I assume AFDP could indicate low-oxygen combustion - the turbine needs more air, but the filter can not provide a high airflow. Differently is NOX, which is only highly correlated to AT. Those correlations can also be seen in the correlation viewer.

Data Modelling and Test

I will train models and apply the already trained model from datasets 2011 and 2013-2015 to the unknown dataset of the year 2012. I will track the test set's MEA, RMSE, and R^2 values. I also make a benchmark with the following options:

Training with no preparation at all
Box-Cox Transform of CO and NOX
Box-Cox Transform all, except AT and AP standardize.
Box-Cox Transform CO & NOX, standardize all features
No Transformation at all, but feature selection a) for CO, I used TIT, AFDP, TEY, GTEP CDP b) for NOX, I used all but TAT

At the same time, I will always train three models, KNeighbors, Support-Vector-Regression, and Light-GBM, with three ensemble models, stacking, voting, and blending. While the training session, I take 60% of the data as train, 15% as the validation set, and another 15% as test set and follow the cross-fold-5 methodology. The models' results and performance will only be calculated on the unseen dataset of 2012. So, in summary, the dataset from the years 2011 and 2013-2015 will be split again into train-validation- and test sets during the cross-fold training period. This will make sure to avoid a common pitfall „called overfitting“.

Here are the final results:

Whereas the voting model and the forths preparation perform best for NOX, CO is not 100% clear. Depending on which metric is more relevant, the user should choose the first preparation and then LGBM, Stacking, or Blending. The 5. option combined with a Blending strategy leads to the lowest MEA, but the performance of R^2 is too bad.

The following diagrams show the best results for NOX and CO (LGBM chosen) in time series and scatter plots.

We can also visually see that the model performs much better for NOX than CO. The yearly seasonal behavior, distribution, and diagonal pattern in the scatter plot look good concerning the given dataset.

The data preparation, model application, and postprocessing are collected into a story. We can publish and automatize the application and use this as virtual sensors for NOX and CO.

Welcome to the Virtual Sensor example

Data Exploration

Distributions:

Data Modelling and Test

Appendix