hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: How to handle multiline record for inputsplit?
Date Tue, 21 May 2013 06:28:29 GMT
If your record is variably multi-line, then quite logically the newline
character cannot be its "record delimiter". Use the right character or
byte(s)/info that defines the real "record delimiter" and read based on
that.

The same logic as the one described at
http://wiki.apache.org/hadoop/HadoopMapReduce for newline-delimited records
applies for your files as well.


On Tue, May 21, 2013 at 11:37 AM, Darpan R <darpanbe@gmail.com> wrote:

> Hi folks,
> I have a huge text file in TBs and it has multiline records. And we are not
> given that each records takes how many lines. One records can be of size 5
> lines, other may be of 6 lines another may be 4 lines. Its not sure. Line
> size may vary for each record.
> Since we cannot use default TextInputFormat, we have written own
> inputformat and a custom record reader but the confusion is :
>
> "When splits are happening, it is not sure if each split will contain the
> full record. Some part of record can go in split 1 and another in split 2."
> But this is not what we want.
>
> So, can anyone suggest how to handle this scenario so that we can guarantee
> that one full record goes in a single InputSplit ?
> Any work around or hint will be really useful.
>
> Thanks in advance.
>  DR
>



-- 
Harsh J

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message