This
blog is part 2 of Azure IoT Edge series. Please see http://blog.mmasood.com/2018/03/azure-iot-and-iot-edge-part-1.html
if you have not read part 1.
In this
blog I will cover the how we can build a logistic regression model in R using
the data the captured in tables storage via IoT Hub.
We can
run the simulated devices (all three at once) and wait for data to be generated
and save it to table storage. But for the simplicity I have created an R script
to generate the data so that I can build the model and deploy it to IoT Edge
and hence we can leverage the this Edge device to apply Machine Learning model
on the data it is receiving from the downstream devices.
I am
using exactly the same minimum temperatures, pressure and humidity as our
simulated device was using. Please see http://blog.mmasood.com/2018/03/azure-iot-and-iot-edge-part-1.html here
are few lines of R script.
You can
download the whole script from https://gist.github.com/mmasooddatascience/2e9807421f147f8abd6b00988eeba37f
Let’s plot the data and see how it looks
like. There are only 3 fields/feature so I will plot Temperature vs Pressure using ggplot2:
Output of above R commands:
We can see as the temperature and pressure
increases the device is becoming bad or getting away from the good devices. For the simplicity the simulation
generates higher number for temperature and pressure if device is flagged as
defective.
Now let’s build a simple logistic
regression model to find out the probability of device being defective.
I am using caret package for building
model. Here is the code to split the training and test data:
The proportion of good vs bad for original
data is: 66% (good)/33% (bad). So we make sure we don’t have skewness in the
data.
Now applying glm function to data using R
script shown below:
Here is the summary of the model:
We can see from above output, the pressure
is not statistically significant. The idea of this post is to have a model that
we will be using in IoT Edge device.
Let’s test this model on test data set and
find out the best threshold to separate the bad from good. I could have used
cross-validation to find the best threshold. Use cross validation set to fine tune the parameters (eg. threshold or lambda if ridge regression is used etc).
Below is the confusion matrix when I use
threshold 0.5:
Let’s construct a data frame which contains
actual, predicted and calculated probability using below code:
And view first and last 5 records:
The higher (or closer to 1) the probability
the device is good.
With threshold 0.65, the confusion matrix
look like below:
So we can see from above two confusion
matrix, the best threshold should be 0.50 as it miss-classifies only 4
instances but when 0.65 is used it miss-classifies 5 instances.
The
final model is given below:
So far I
have got the model built. I will use this model in IoT Edge module which will
make Edge intelligent, which I will post soon so stay tuned and happy IoTing J