spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jritz <>
Subject Sorting a Sequence File
Date Thu, 02 Oct 2014 19:32:17 GMT

I am having trouble getting a sequence file sorted.  My sequence file is
(Text, Text) and when trying to sort it, Spark complains that it can not
because Text is not serializable.  To get around this issue, I performed a
map on the sequence file to turn it into (String, String).  I then perform
the sort and then write it back out as a sequence file to hdfs.

My issue is that this solution does not scale.  I can run this code for a
32GB file and it runs without issues.  When I run it with at 500GB file, it
runs some of the data nodes out of physical disk space.  It spills like
crazy (usually 2-3 times the amount of original data).  So my 32 GB file
spills 74GB.  

I believe my issue is that there is a better way to get the data into a form
that sort will accept.  Is there a better way to do it other than mapping
the key and value to Strings?



View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message