In this post I will be exploring how can we use MapReduce to implement
Gradient Descent algorithm in Hadoop for large scale data. As we know Hadoop is capable of handling peta-byte scale/size of the data.
In this article I will be using following concept:
- Gradient Descent algorithm
- Hadoop Map Reduce Job
- Writable
- Reading from HDFS via Hadoop API
- Hadoop Counter
Before starting, first we need to understand what is
Gradient Descent and where can we use it. Below is an excerpt from Wikipedia:
Gradient descent is a first-order iterative optimization algorithm.
To find a local minimum of a function using gradient
descent, one takes steps proportional to the negative of the gradient (or
of the approximate gradient) of the function at the current point. If instead
one takes steps proportional to the positive of the gradient, one
approaches a local maximum of that function; the
procedure is then known as gradient ascent.
Gradient descent is also known as steepest
descent, or the method of steepest descent. Gradient descent should not be
confused with the method of steepest descent for
approximating integrals.
If you look at the algorithm, it is an iterative
optimisation algorithm. So if we are talking about millions of observations,
then we need to iterate those millions of observations and adjust out parameter
(theta).
Some mathematical notations:
where
And
Andrew Ng has well explained at https://www.coursera.org/learn/machine-learning/lecture/10sqI/map-reduce-and-data-parallelism
Now, the question is how can we leverage Hadoop to
distribute the work load to minimize the cost function and find the theta
parameter?
MapReduce programming model comprises two phases. 1 Map, 2. Reduce shown in below picture. Hadoop gives programmer to only focus on map and reduce phase and rest of the workload is taken care by Hadoop. Programmers do not need to think how I am going to split data etc.
Please visit https://en.wikipedia.org/wiki/MapReduce to know about MapReduce framework.
MapReduce programming model comprises two phases. 1 Map, 2. Reduce shown in below picture. Hadoop gives programmer to only focus on map and reduce phase and rest of the workload is taken care by Hadoop. Programmers do not need to think how I am going to split data etc.
Please visit https://en.wikipedia.org/wiki/MapReduce to know about MapReduce framework.
When user uploads data to HDFS, the data are splited and saved in various data nodes.
Now we know Hadoop will provide subset of data to each Mapper. So we can program our mapper to emit PartialGradientDescent serializable object. For instance if one split has 50 observations, then that mapper will return 50 partial gradient descent objects.
Now we know Hadoop will provide subset of data to each Mapper. So we can program our mapper to emit PartialGradientDescent serializable object. For instance if one split has 50 observations, then that mapper will return 50 partial gradient descent objects.
One more thing, there is only ONE reducer in this example, so reducer will
get whole lot of data, it would be better to introduce combiner so that reducer
will get low number of PartialGradientDescent
objects or you can apply in-memory
combining design pattern for MapReduce which I will cover in next post.
Now let’s get into java map reduce program. Before reading further it would be better you understand the Writable
concept in Hadoop and some matrix algebra.
Mapper code:We can see that map task is emitting partialGradientDescent object with lot of information. Like sum0, sum1 and 1. These information will be required in reducer to update the theta.
Now let's have a look at reducer code:
We can see from Reducer code that we are summing up all given partial gradients. This can be improved if we supply combiner that does some partial sum before reaching to reducer. For instance if we have 50 mapper, then after each mapper the combiner will sum and send to reducer in that case reducer will get 50 partial gradient objects.
and custom writable (ie. PartialGradientDescent)
and the last piece of the puzzle is the Driver program that trigger the Hadoop job based on number of iterations you need.
That's it for now. Stay tuned.