spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ilya Ganelin (JIRA)" <>
Subject [jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
Date Thu, 11 Dec 2014 18:35:13 GMT


Ilya Ganelin commented on SPARK-3533:

I am looking into a solution for this.

> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>                 Key: SPARK-3533
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Nicholas Chammas
> Users often have a single RDD of key-value pairs that they want to save to multiple locations
based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda
x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so that I have
one output directory per distinct key. Each output directory could potentially have multiple
{{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}},
{{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't
straightforward. It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes
it easy to save RDDs out to multiple locations at once.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message