spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chuwiey <ben.fona...@gmail.com>
Subject Re: PySpark, ResultIterable and taking a list and saving it into different parquet files
Date Mon, 23 Mar 2015 16:14:37 GMT
In case anyone wants to learn about my solution for this:
groupByKey is highly inefficient due to the swapping of elements between the
different partitions as well as requiring enough mem in each worker to
handle the elements for each group. So instead of using groupByKey, I ended
up taking the flatMap result, and using subtractByKey in such a way that I
ended up with multiple rdds only including the key I wanted; Now I can
iterate over each rdd independently and end up with multiple parquets.

Thinking of submitting a splitByKeys() pull request, that would take an
array of keys and an rdd, and return an array of rdds each with only one of
the keys. Any thoughts around this?

Thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-ResultIterable-and-taking-a-list-and-saving-it-into-different-parquet-files-tp22152p22189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message