Navigation Bar

Saturday 27 August 2016

Implementing Gradient Descent Algorithm in Hadoop for large scale data

In this post I will be exploring how can we use MapReduce to implement Gradient Descent algorithm in Hadoop for large scale data. As we know Hadoop is capable of handling peta-byte scale/size of the data.

In this article I will be using following concept:
  • Gradient Descent algorithm
  • Hadoop Map Reduce Job
  • Writable
  • Reading from HDFS via Hadoop API
  • Hadoop Counter

Before starting, first we need to understand what is Gradient Descent and where can we use it. Below is an excerpt from Wikipedia:
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is also known as steepest descent, or the method of steepest descent. Gradient descent should not be confused with the method of steepest descent for approximating integrals.

If you look at the algorithm, it is an iterative optimisation algorithm. So if we are talking about millions of observations, then we need to iterate those millions of observations and adjust out parameter (theta).

Some mathematical notations:








where



















And






Now, the question is how can we leverage Hadoop to distribute the work load to minimize the cost function and find the theta parameter?

MapReduce programming model comprises two phases. 1 Map, 2. Reduce shown in below picture. Hadoop gives programmer to only focus on map and reduce phase and rest of the workload is taken care by Hadoop. Programmers do not need to think how I am going to split data etc.

Please visit https://en.wikipedia.org/wiki/MapReduce to know about MapReduce framework.
















When user uploads data to HDFS, the data are splited and saved in various data nodes.
Now we know Hadoop will provide subset of data to each Mapper. So we can program our mapper to emit PartialGradientDescent serializable object. For instance if one split has 50 observations, then that mapper will return 50 partial gradient descent objects.

One more thing, there is only ONE reducer in this example, so reducer will get whole lot of data, it would be better to introduce combiner so that reducer will get low number of PartialGradientDescent objects or you can apply in-memory combining design pattern for MapReduce which I will cover in next post.

Now let’s get into java map reduce program. Before reading further it would be better you understand the Writable concept in Hadoop and some matrix algebra.

Mapper code:

We can see that map task is emitting partialGradientDescent object with lot of information. Like sum0, sum1 and 1. These information will be required in reducer to update the theta.





Now let's have a look at reducer code:


We can see from Reducer code that we are summing up all given partial gradients. This can be improved if we supply combiner that does some partial sum before reaching to reducer. For instance if we have 50 mapper, then after each mapper the combiner will sum and send to reducer in that case reducer will get 50 partial gradient objects.

and custom writable (ie. PartialGradientDescent)


and the last piece of the puzzle is the Driver program that trigger the Hadoop job based on number of iterations you need.

That's it for now. Stay tuned.

Tuesday 15 March 2016

Timer job – from on premise to Cloud World (Azure) using WebJob

The SharePoint Timer service runs in background to do long running tasks. The Timer service does some important SharePoint clean up tasks in the background but can also be used to provide useful functional tasks. For instance, there may be  a situation when you want to send newsletters to your users on regular basis or want to keep your customer up to date with  some regular timed information.

This is part two of the series. Please read first post at http://blog.mmasood.com/2015/04/from-sharepoint-on-premise-to-office365.html of series "From SharePoint On-Prem to Office365"

I will be using SharePoint Timer Service to send an email to newly registered customers/users for this demo. The newly registered customers/users are stored in SharePoint list with a status field capturing whether an email has been sent or not.
There are some implementation choices when developing a SharePoint Timer service:
  1. Azure Web Job
  2. Azure Worker Role
  3. Windows Service (can be hosted on premise or vm on Cloud)
  4. Task Scheduler (hosted on premise)
I am choosing WebJob as it is free of cost and I can leverage my Console application as WebJob. Please check http://www.troyhunt.com/2015/01/azure-webjobs-are-awesome-and-you.html why to choose Web Job.

Azure web job does not live it its own. It sits under Azure Web Apps. For this purpose I am going to create a dummy web app and host my Azure web job. I will be hosting all my CSOM code in this web job.

There are two types of web job:
  • Continuous best fit for queuing application where it keeps receiving messages from queue.
  • On Demand can be scheduled for hourly, weekly and monthly etc.
The Web Job is used to host and execute CSOM code to get information about the user/customers from SharePoint to send email. Following code snippets show what web job is doing:

Querying SharePoint using CSOM and CAML Query:


Sends Email using Office365 Web Exchange:

Composing email using Razor Engine templating engine:

And finally update SharePoint list item using CSOM:

You can download full source code from Codeplex: https://webjobforsptimer.codeplex.com/
When writing a Web Job, the following points should be considered to make your web job diagnosable and reusable:
  1. Do not absorbs exceptions. Handle it first throw it to let web job know something went wrong.
  2. Try to use interfaces so that it can be mocked for unit testings
  3. Always log major steps and errors using Console.WriteLine etc
  4. Make your code like it can be used as console application so that it can be used in Task scheduler
  5. Try to avoid hardcoding. Maximise the use of configuration. It can be plugged from Azure portal as well.
It is time to publish this web job. There are lots of article out there how to create schedule for the web job. I would simply be using Visual Studio to create the schedule before publish it. On Visual Studio, right click the project and click “Publish as Azure Web Job…” and it will launch a UI to specify your schedule as shown below:
Schedule settings
That’s it. Happy SharePointing :)