spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Albert <>
Subject Reading one partition at a time
Date Mon, 05 Jan 2015 00:30:53 GMT

I would like to know if the code below will read "one-partition-at-a-time",
and whether I am reinventing the wheel.

If I may explain, upstream code has managed (I hope) to save an RDD such 
that each partition file (e.g, part-r-00000, part-r-00001) contains exactly the data subset
which I would like to 
repackage in a file of a non-hadoop format.  So what I want to do is 
something like "mapPartitionsWithIndex" on this data (which is stored in 
sequence files, SNAPPY compressed).  However, if I simply open the data
set with "sequenceFile()", the data is re-partitioned and I loose the partitioning
I want. My intention is that in the closure passed to mapPartitionWithIndex,

I'll open an HDFS file and write the data from the partition in my desired format, 
one file for each input partition.
The code below seems to work, I think.  Have I missed something bad?


    class NonSplittingSequenceFileInputFormat[K,V]
        //extends org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat[K,V] // XXX
        extends org.apache.hadoop.mapred.SequenceFileInputFormat[K,V]
        override def isSplitable(
            //context: org.apache.hadoop.mapreduce.JobContext,
            //path: org.apache.hadoop.fs.Path) = false
            fs: org.apache.hadoop.fs.FileSystem,
            filename: org.apache.hadoop.fs.Path) = false


    classOf[NonSplittingSequenceFileInputFormat[K, V]],

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message