spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Or Raz <>
Subject Dataflow of Spark/Hadoop in steps
Date Sun, 23 Oct 2016 11:00:01 GMT
 I would like to know if I have 100 GB data and I would like to find the most
common world ,actually what is going on in my cluster(lets say a master node
and 6 workers) step by step.(1)
what does the master(2)? start the mapreduce job, monitor the traffic and
return the result? the same goes for what the mappers and reducers (can they
be a different node/worker?) do(3)?.
The reducers always wait for all the mappers to finish before they start?(4)
and who combine/attach the final output?

For example:

These is the input for the reducers ,these tuples(I go for an easy example
that each word is unique to every reducer, which means the shuffle step has
been done correctly)

reducer 1: {dog,1} ,{banana,1},{oreo,6} 
reducer 2: {peach,1} ,{mesut,5},{ozil,10} 
reducer 3: {I,4} ,{witch,2} 
reducer 4: {fear,1} ,{goal,6},{arsenal,3}

The output of each reducer should be :

reducer 1: {oreo,6} 
reducer 2: {ozil,10} 
reducer 3: {I,4} 
reducer 4: {goal,6}

Now we need to combine the results, to who we send the output and he will do
sort and aggregate(the master?) (5)? and in those steps and before where
there are I/O calls?(6) (when the data is stored on local disk and when on

In addition in Hadoop as far as I know we need to deploy the Map and Reduce
functions to the matching workers, can we change the functions in run
time?(7) , if we have done a mapreduce job and we want to that again(go over
our results) do we have to split the data again and send it to each

P.S. I have numbered the questions for understanding where are my questions.

Any more comments or notions would be appreciated :)

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:

View raw message