spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillermo Ortiz <>
Subject Define size partitions
Date Fri, 30 Jan 2015 14:50:42 GMT

I want to process some files, there're a king of big, dozens of
gigabytes each one. I get them like a array of bytes and there's an
structure inside of them.

I have a header which describes the structure. It could be like:
Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ......
This structure appears N times on the file.

So, I could know the size of each block since it's fix. There's not
separator among block and block.

If I would do this with MapReduce, I could implement a new
RecordReader and InputFormat  to read each block because I know the
size of them and I'd fix the split size in the driver. (blockX1000 for
example). On this way, I could know that each split for each mapper
has complete blocks and there isn't a piece of the last block in the
next split.

Spark works with RDD and partitions, How could I resize  each
partition to do that?? is it possible? I guess that Spark doesn't use
the RecordReader and these classes for these tasks.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message