hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cloudyarea (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-6598) LineReader enhencement to support text records contains "\n"
Date Tue, 05 Jan 2016 11:25:39 GMT
cloudyarea created MAPREDUCE-6598:

             Summary: LineReader enhencement to support text records contains "\n"
                 Key: MAPREDUCE-6598
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6598
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: mrv2
    Affects Versions: 2.6.0
         Environment: RHEL 7, Spark 1.3.1, Hadoop 2.6.0
            Reporter: cloudyarea
            Priority: Minor

We have billions of XML message records stored on text files need to be parsed parallel by
Spark. By default, Spark open a Hadoop text file using LineReader which provides a single
line of text as a record. 

The XML messages contains "\n" and I believe it is a common scenario - many users have cross-line
records. Currently, the solution is to the extend the interface RecordReader.

To reduce the repeat work, I wrote a class named MessageRecordReader to extend the interface
RecordReader, user can set a string as record delimiter, then MessageRecordReader provides
a multiple line record to user. 

I would like to contribute the code to community. Please let me know if you are interested
in this simple but useful implementation. 

Thank you very much and happy new year!

This message was sent by Atlassian JIRA

View raw message