spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Puneet Lakhina <>
Subject Text file and shuffle
Date Sun, 18 May 2014 02:41:59 GMT

I'm new to spark and I wanted to understand a few things conceptually so that I can optimize
my spark job. I have a large text file (~14G, 200k lines). This file is available on each
worker node of my spark cluster. The job I run calls sc.textFile(...).flatmap(...) . The function
that I pass into flat map splits up each line from the file into a key and value. Now I have
another text file which is smaller in size(~1.5G) but has a lot more lines because it has
more than one value per key spread across multiple lines. . I call the same textFile and flatmap
functions on they other file and then call groupByKey to have all values for a key available
as a list. 

Having done this I then cogroup these 2 RDDs. I have the following questions

1. Is this sequence of steps the best way to achieve what I want, I.e a join across the 2
data sets?

2. I have a 8 node (25 Gb memory each) . The large file flatmap spawns about 400 odd tasks
whereas the small file flatmap only spawns about 30 odd tasks. The large file's flatmap takes
about 2-3 mins and during this time it seems to do about 3G of shuffle write. I want to understand
if this shuffle write is something I can avoid. From what I have read, the shuffle write is
a disk write. Is that correct? Also is the reason for the shuffle write the fact that the
partitioner for flatmap ends up having to redistribute the data across the cluster? 

Please let me know if I haven't provided enough information. I'm new to spark so if you see
anything fundamental that I don't understand please feel free to just point me to a link that
provides some detailed information.

View raw message