spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: Define size partitions
Date Fri, 30 Jan 2015 19:22:43 GMT
I think the new API  sc. binaryRecords [1] (added in 1.2) can help in this case.

[1] http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords

Davies

On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz <konstt2000@gmail.com> wrote:
> Hi,
>
> I want to process some files, there're a king of big, dozens of
> gigabytes each one. I get them like a array of bytes and there's an
> structure inside of them.
>
> I have a header which describes the structure. It could be like:
> Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ......
> This structure appears N times on the file.
>
> So, I could know the size of each block since it's fix. There's not
> separator among block and block.
>
> If I would do this with MapReduce, I could implement a new
> RecordReader and InputFormat  to read each block because I know the
> size of them and I'd fix the split size in the driver. (blockX1000 for
> example). On this way, I could know that each split for each mapper
> has complete blocks and there isn't a piece of the last block in the
> next split.
>
> Spark works with RDD and partitions, How could I resize  each
> partition to do that?? is it possible? I guess that Spark doesn't use
> the RecordReader and these classes for these tasks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message