Thanks Ben for your answer. I’ll explore what happens under the hoods in a data frame.
Regarding the ability to split a large RDD into n RDDs without requiring n passes to the large RDD. Can partitionBy() help? If I partition by a key that corresponds to the the split criteria (i..e client id) and then cache each of those RDDs.
Will that lessen the effect of repeated large traversals (since Spark will figure out that for each smaller RDD it just needs to traverse a subset of the partitions)?
Hi Maximo —
This is a relatively naive answer, but I would consider structuring the RDD into a DataFrame, then saving the 'splits' using something like DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then read a DataFrame from each resulting
parquet directory and do your per-client work from these. You mention re-using the splits, so this solution might be worth the file-writing time.
Does anyone know of a method that gets a collection of DataFrames — one for each partition, in the byPartition=('client') sense — from a 'big' DataFrame? Basically, the equivalent of writing by partition and creating a DataFrame for each result,
but skipping the HDFS step.