spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jtgenesis <jtgene...@gmail.com>
Subject Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)
Date Fri, 15 Jul 2016 17:31:44 GMT
I'm working with a single image file that consists of headers and a multitude
of different of data segment types (each data segment having its own
sub-header that contains meta data). Currently using Hadoop's HDFS.

Example file layout:

| Header | Seg A-1 Sub-Header | Seg A-1 Data | Seg A-2 SubHdr | Seg A-2 Data
| Seg B-1 Subhdr | Seg B-1 Data | Seg C-1 SubHdr | Seg C-1 Data | etc....

The headers will vary from 1-10 Kb in size and each Data segment size will
vary anywhere from 10KB - 10GB. The headers are represented as characters
and the data is represented as binary. The headers include some useful
information like number of segments, size of subheaders and segment data
(I'll need this to create my splits).

To digest it all, I'm wondering if it's best to create a custom InputFormat
inheriting from (1) FileInputFormat or (2) SequenceFileInputFormat.

If I go with (1), I will create HeaderSplits and DataSplits (data splits
will be equiv to block size 128MB). I would also create a custom
RecordReader for the DataSplits. Where the record size will be of tile
sizes, 1024^2 Bytes. In the record reader, I will just read a tile size at a
time. For the headers, each split will contain one record.

If i go with (2), I believe the bulk of my work would be in converting my
image file to a SequenceFile. I would create a a key,value for each
header/subheader, and a key/value for every 1024^2 Bytes in my Segment Data.
Once I do that, I would have to create a custom SequenceFileInputFormat that
will also split the headers from the partitioned data segments. I read that
SequenceFiles are great for dealing with the "large # of small files"
problem, but I'm dealing with just 1 image file (although with possibly many
different data segments).

I also noticed that SequenceFileInputFormat uses FileInputFormat getSplits
implementation. I'm assuming I would have to modify it to get the kinds of
splits that I want. (Extract the Header key/value pair and parse/extract
size info, etc).

Is one approach better than the other? I feel (1) would be a simpler task,
but (2) has a lot of nice features. Is there a better way? 

This is probably more of a hadoop question, but was curious if anyone had
experience with this.

Thank you in advance!




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-InputFormat-SequenceFileInputFormat-vs-FileInputFormat-tp27344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message