Navigation Bar

Wednesday 9 May 2018

Azure IoT and IoT Edge - Part 2 (Building a Machine Learning model using generated data)


This blog is part 2 of Azure IoT Edge series. Please see http://blog.mmasood.com/2018/03/azure-iot-and-iot-edge-part-1.html if you have not read part 1.

In this blog I will cover the how we can build a logistic regression model in R using the data the captured in tables storage via IoT Hub.

We can run the simulated devices (all three at once) and wait for data to be generated and save it to table storage. But for the simplicity I have created an R script to generate the data so that I can build the model and deploy it to IoT Edge and hence we can leverage the this Edge device to apply Machine Learning model on the data it is receiving from the downstream devices.

I am using exactly the same minimum temperatures, pressure and humidity as our simulated device was using. Please see http://blog.mmasood.com/2018/03/azure-iot-and-iot-edge-part-1.html here are few lines of R script.




Let’s plot the data and see how it looks like. There are only 3 fields/feature so I will plot  Temperature vs Pressure using ggplot2:







Output of above R commands:

















We can see as the temperature and pressure increases the device is becoming bad or getting away from the good devices. For the simplicity the simulation generates higher number for temperature and pressure if device is flagged as defective.

Now let’s build a simple logistic regression model to find out the probability of device being defective.
I am using caret package for building model. Here is the code to split the training and test data:








The proportion of good vs bad for original data is: 66% (good)/33% (bad). So we make sure we don’t have skewness in the data.







Now applying glm function to data using R script shown below:







Here is the summary of the model:

















We can see from above output, the pressure is not statistically significant. The idea of this post is to have a model that we will be using in IoT Edge device.

Let’s test this model on test data set and find out the best threshold to separate the bad from good. I could have used cross-validation to find the best threshold. Use cross validation set to fine tune the parameters (eg. threshold or lambda if ridge regression is used etc).

Below is the confusion matrix when I use threshold 0.5:







Let’s construct a data frame which contains actual, predicted and calculated probability using below code:



And view first and last 5 records:

















The higher (or closer to 1) the probability the device is good.

With threshold 0.65, the confusion matrix look like below:








So we can see from above two confusion matrix, the best threshold should be 0.50 as it miss-classifies only 4 instances but when 0.65 is used it miss-classifies 5 instances.

The final model is given below:










So far I have got the model built. I will use this model in IoT Edge module which will make Edge intelligent, which I will post soon so stay tuned and happy IoTing J