Slides Youtube

MapReduce

  • MapReduce is a programming model and software system developed by Google .
  • Characters: client-server architecture, message-passing communication, and bulk synchronous parallel.
  • Apache Hadoop is an open-source implementation of MapReduce.
  • Apache Spark is an improved open-source MapReduce.

Broadcase

mapreduce-1

Map

mapreduce-2

Reduce

mapreduce-3

Data Parallelism

Partition the data among worker nodes. (A node has a subset of data.)

data_parallelism_1

Parallel Gradient Descent Using MapReduce

  • Broadcast: Server broadcast the up-to-date parameters wtw_t o workers.
  • Map: Workers do computation locally.
    • Map (xi,yi,wt)(x_i,y_i,w_t) to gi=(xiTwtyi)xig_i=(x_i^T w_t-yi)xi.
    • Obtain nn vectors: g1,g2,g3,...,gng_1, g_2,g_3,...,g_n
  • Reduce: Compute the sum: g=i=1ngig=\sum_{i=1}^{n}g_i
  • Every worker sums all the gi{g_i} stored in its local memory to get a vector.
  • Then, the server sums the resulting m vectors. (There are m workers.)
  • Server updates the parameters: wt+1=ttαgw_{t+1}=t_t-\alpha \cdot g

Parallel_Gradient_Descent_Using_MapReduce_1

Speedup Ratio

speedup_ratio_1

Communication Cost

  • Communication complexity: How many words are transmitted between server and workers.
    • Proportional to number of parameters.
    • Grow with number of worker nodes.
  • Latency: How much time it takes for a packet of data to get from one point to another. (Determined by the compute network.)
  • Communication time: comlexitybancwith+latency\frac{comlexity}{bancwith}+latency

Bulk Synchronous

Bulk_Synchronous_1

Synchronization Cost

Question: What if a node fails and then restart?

  • This node will be much slower than all the others.
  • It is called straggler.
  • Straggler effect:
    • The wall-clock time is determined by the slowest node.
    • It is a consequence of synchronization.

Footnote