spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)
Date Fri, 15 Jul 2016 23:01:54 GMT
I am not sure if I exactly understand your use case, but for my Hadoop/Spark format that reads
the Bitcoin blockchain I extend from  FileInputFormat. I use the default split mechanism.
This could mean that I split in the middle of a bitcoin block, which is no issue, because
the first split can reach beyond its original size (in this case the remaining necessary data
might be transferred from a remote node) and the second split can be seeked through the next
block.

However the main different thing to your case it that my blocks are of similar size. Your
block size can vary a lot, which means that one task could be busy with a small block and
another with a very big block. This means parallel processing might be suboptimal. Here it
depends now what do you plan with the blocks afterwards?

> On 15 Jul 2016, at 19:31, jtgenesis <jtgenesis@gmail.com> wrote:
> 
> I'm working with a single image file that consists of headers and a multitude
> of different of data segment types (each data segment having its own
> sub-header that contains meta data). Currently using Hadoop's HDFS.
> 
> Example file layout:
> 
> | Header | Seg A-1 Sub-Header | Seg A-1 Data | Seg A-2 SubHdr | Seg A-2 Data
> | Seg B-1 Subhdr | Seg B-1 Data | Seg C-1 SubHdr | Seg C-1 Data | etc....
> 
> The headers will vary from 1-10 Kb in size and each Data segment size will
> vary anywhere from 10KB - 10GB. The headers are represented as characters
> and the data is represented as binary. The headers include some useful
> information like number of segments, size of subheaders and segment data
> (I'll need this to create my splits).
> 
> To digest it all, I'm wondering if it's best to create a custom InputFormat
> inheriting from (1) FileInputFormat or (2) SequenceFileInputFormat.
> 
> If I go with (1), I will create HeaderSplits and DataSplits (data splits
> will be equiv to block size 128MB). I would also create a custom
> RecordReader for the DataSplits. Where the record size will be of tile
> sizes, 1024^2 Bytes. In the record reader, I will just read a tile size at a
> time. For the headers, each split will contain one record.
> 
> If i go with (2), I believe the bulk of my work would be in converting my
> image file to a SequenceFile. I would create a a key,value for each
> header/subheader, and a key/value for every 1024^2 Bytes in my Segment Data.
> Once I do that, I would have to create a custom SequenceFileInputFormat that
> will also split the headers from the partitioned data segments. I read that
> SequenceFiles are great for dealing with the "large # of small files"
> problem, but I'm dealing with just 1 image file (although with possibly many
> different data segments).
> 
> I also noticed that SequenceFileInputFormat uses FileInputFormat getSplits
> implementation. I'm assuming I would have to modify it to get the kinds of
> splits that I want. (Extract the Header key/value pair and parse/extract
> size info, etc).
> 
> Is one approach better than the other? I feel (1) would be a simpler task,
> but (2) has a lot of nice features. Is there a better way? 
> 
> This is probably more of a hadoop question, but was curious if anyone had
> experience with this.
> 
> Thank you in advance!
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-InputFormat-SequenceFileInputFormat-vs-FileInputFormat-tp27344.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message