Navigation Bar

Thursday, 18 May 2017

Exploring SparkR using Databricks environment

In this exploration I will share what I have learnt so far R with Spark. Spark, as you all know, is a distributed computing framework. It allows you to program in Scala/python/Java and now in R for performing distributed computation.

I implemented gradient descent in Hadoop to understand how we are going to parallelize gradient computation. Please have a read about it at http://blog.mmasood.com/2016/08/implementing-gradient-decent-algorithm.html for understanding mathematics behind it.

Now I am implementing same gradient descent algorithm in SparkR using Databricks community edition. You might be wondering why I am implementing it again J

I always start with the knowledge I have right now then I use those knowledge to learn new language. For this instance Gradient Descent algorithm bets fit here. Also we learn couple of things while implementing GD like:

  •      How we break big loop into cluster of computers
  •      How we are transforming data in parallel
  •      How we share/send common variables/values to worker nodes
  •      How we are aggregating results from worker nodes.
  •      Finally combining those results


If your algorithm has to iterate over millions or more records then it is worth parallelizing it. Any computation you do, you will almost be doing same sort of things as I outlined above. I can use above high-level tasks mentioned above to build a complex Machine Learning model like ensembling models or model stacking etc.

Please write in comments if you have other items than I have listed above
J to learn from you as well.


Now I have talked too much, let's do some coding J

You need to sign-up at https://databricks.com/ first. Once you have done it you can follow it.
Now, navigate to databricks community edition home page like shown below:





First you need to create a cluster first, click on Clusters > Create Cluster to create a cluster. Use Spark 2.1 (Auto-updating, Scala 2.10)

Next, upload your data to cluster. To do this, click on Tables > Create Table you will be presented like below screen:



Click on “Drop file or click here to upload” section and upload your file. Once you have uploaded the file it will show you the path. Note that path to somewhere.



Now, create a notebook by navigating to Workspace and click on dropdown and select Create > Notebook like shown below:




And provide the name for the notebook. I called “SparkR-GradientDescent”



Make sure you have selected R as language. Click on Create button to create the notebook. Now navigate to your newly created notebook and start writing R code J

We now need to load the data. Remember that we are running R code in Spark so we need to use read.df (from SparkR package) to load data into a SparkDataFrame (not data.frame).



Note that all above methods are similar but they are from SparkR package. All these methods understand SparkDataFrame object. Let's run below code to see the structure of the object:





Now run below code:



You can see both are two different object.

Now, I define a method that calculates partial gradient so that we can compute it on worker nodes and get the result back to driver program.

Here is the code:
 

Now, we write code that initiate worker nodes to calculate partial gradients on each partition, collect those calculated data and update our thetas using below codes:



Here is the result of above code:



Few things to note in above code:

  1. We are caching (using cache(data)) data in memory so that in each iteration Spark does not need to load data from storage.
  2. We are defining schema because dapply needs to transform an r data.frame object to SparkDataFrame with provide schema
  3. We are performing some calculations (partial gradient) on each partition using dapply.  So we are telling spark to run given function on each partition residing on worker nodes.
  4. Each worker nodes are getting a shared variable/object. In Spark-scala we had to broadcast the variable.
  5. We are collecting data from worker nodes as r data.frame object using collect method.
  6. Updating theta and that will be available to each worker in next iteration.

You can view available functions in SparkR package at https://docs.databricks.com/spark/latest/sparkr/index.html
Finally we can validate our estimated coefficient using lm package in R (running locally on my machine)





We can validate our calculation on sample data so that we can debug it easily. We can see that estimated coefficients are close to what lm model gave me. If we increase number of iterations we can get thetas close to it.

I hope that this post will help you understanding SparkR. Please provide your feedback if I missed anything.

That’s it for now. Enjoy coding :)