spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: spark streaming multiple file output paths
Date Thu, 07 Aug 2014 17:04:22 GMT
The problem boils down to how to write an RDD in that way. You could use
the HDFS Filesystem API to write each partition directly.

pairRDD.groupByKey().foreachPartition(iterator =>
   iterator.map { case (key, values) =>
      // Open an output stream to destination file
 <base-path>/key/<whatever>
      // Write values to the file
      // Close the file
   }
}

You can even go fancier by writing to a temp file, and then moving the file
to the write location. This is tolerate failures in the middle of writing
(saveAsTextFile does this underneath).

TD


On Thu, Aug 7, 2014 at 8:39 AM, Chen Song <chen.song.82@gmail.com> wrote:

> In Spark Streaming, is there a way to write output to different paths
> based on the partition key? The saveAsTextFiles method will write output in
> the same directory.
>
> For example, if the partition key has a hour/day column and I want to
> separate DStream output into different directories by hour/day.
>
> --
> Chen Song
>
>

Mime
View raw message