In this
exploration I will share what I have learnt so far R with Spark. Spark, as you
all know, is a distributed computing framework. It allows you to program in
Scala/python/Java and now in R for performing distributed computation.
Please write in comments if you have other items than I have listed above J to learn from you as well.
We now need to load the data. Remember that we
are running R code in Spark so we need to use read.df (from SparkR package) to load data into a SparkDataFrame
(not data.frame).
Note that all above methods are similar but they are from SparkR package. All these methods understand SparkDataFrame object. Let's run below code to see the structure of the object:
Now run below code:
Now, I define a method that calculates partial
gradient so that we can compute it on worker nodes and get the result back to
driver program.
Here is the code:
Now, we write code that initiate worker nodes to calculate partial gradients on each partition, collect those calculated data and update our thetas using below codes:
Here is the result of above code:
You can view available functions in SparkR package at https://docs.databricks.com/spark/latest/sparkr/index.html
Finally we can validate our estimated coefficient using lm package in R (running locally on my machine)
I hope that this post will help you understanding SparkR. Please provide your feedback if I missed anything.
That’s it for now. Enjoy coding :)
I
implemented gradient descent in Hadoop to understand how we are going to
parallelize gradient computation. Please have a read about it at http://blog.mmasood.com/2016/08/implementing-gradient-decent-algorithm.html
for understanding mathematics behind it.
Now I am
implementing same gradient descent algorithm in SparkR using Databricks
community edition. You might be wondering why I am implementing it again J
I always
start with the knowledge I have right now then I use those knowledge to learn
new language. For this instance Gradient Descent algorithm bets fit here. Also
we learn couple of things while implementing GD like:
- How we break big loop into cluster of computers
- How we are transforming data in parallel
- How we share/send common variables/values to worker nodes
- How we are aggregating results from worker nodes.
- Finally combining those results
If your algorithm has to iterate over
millions or more records then it is worth parallelizing it. Any computation you
do, you will almost be doing same sort of things as I outlined above. I can use
above high-level tasks mentioned above to build a complex Machine Learning
model like ensembling models or model stacking etc.
Please write in comments if you have other items than I have listed above J to learn from you as well.
You can
download data from https://github.com/schneems/Octave/blob/master/mlclass-ex1/ex1data1.txt
Now I have talked too much, let's do some
coding J
You need to sign-up at https://databricks.com/ first. Once you have
done it you can follow it.
Now, navigate to databricks community
edition home page like shown below:
First you need to create a cluster first,
click on Clusters > Create Cluster to create a cluster. Use Spark 2.1
(Auto-updating, Scala 2.10)
Next, upload your data to cluster. To do
this, click on Tables > Create Table you will be presented like below
screen:
Click on “Drop file or click here to upload”
section and upload your file. Once you have uploaded the file it will show you
the path. Note that path to somewhere.
Now, create a notebook by navigating to Workspace
and click on dropdown and select Create > Notebook like shown below:
And provide the name for the notebook. I
called “SparkR-GradientDescent”
Make sure you have selected R as language.
Click on Create button to create the notebook. Now navigate to your newly
created notebook and start writing R code J
Note that all above methods are similar but they are from SparkR package. All these methods understand SparkDataFrame object. Let's run below code to see the structure of the object:
Now run below code:
You can see both are two different object.
Here is the code:
Now, we write code that initiate worker nodes to calculate partial gradients on each partition, collect those calculated data and update our thetas using below codes:
Here is the result of above code:
Few
things to note in above code:
- We are caching (using cache(data)) data in memory so that in each iteration Spark does not need to load data from storage.
- We are defining schema because dapply needs to transform an r data.frame object to SparkDataFrame with provide schema
- We are performing some calculations (partial gradient) on each partition using dapply. So we are telling spark to run given function on each partition residing on worker nodes.
- Each worker nodes are getting a shared variable/object. In Spark-scala we had to broadcast the variable.
- We are collecting data from worker nodes as r data.frame object using collect method.
- Updating theta and that will be available to each worker in next iteration.
You can view available functions in SparkR package at https://docs.databricks.com/spark/latest/sparkr/index.html
Finally we can validate our estimated coefficient using lm package in R (running locally on my machine)
We can validate our calculation on sample
data so that we can debug it easily. We can see that estimated coefficients are
close to what lm model gave me. If we increase number of iterations we can get
thetas close to it.
That’s it for now. Enjoy coding :)