hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darpan R <darpa...@gmail.com>
Subject How to handle multiline record for inputsplit?
Date Tue, 21 May 2013 06:07:54 GMT
Hi folks,
I have a huge text file in TBs and it has multiline records. And we are not
given that each records takes how many lines. One records can be of size 5
lines, other may be of 6 lines another may be 4 lines. Its not sure. Line
size may vary for each record.
Since we cannot use default TextInputFormat, we have written own
inputformat and a custom record reader but the confusion is :

"When splits are happening, it is not sure if each split will contain the
full record. Some part of record can go in split 1 and another in split 2."
But this is not what we want.

So, can anyone suggest how to handle this scenario so that we can guarantee
that one full record goes in a single InputSplit ?
Any work around or hint will be really useful.

Thanks in advance.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message