hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cloudyarea (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-6598) LineReader enhencement to support text records contains "\n"
Date Tue, 05 Jan 2016 11:25:39 GMT
cloudyarea created MAPREDUCE-6598:
-------------------------------------

             Summary: LineReader enhencement to support text records contains "\n"
                 Key: MAPREDUCE-6598
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6598
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: mrv2
    Affects Versions: 2.6.0
         Environment: RHEL 7, Spark 1.3.1, Hadoop 2.6.0
            Reporter: cloudyarea
            Priority: Minor


We have billions of XML message records stored on text files need to be parsed parallel by
Spark. By default, Spark open a Hadoop text file using LineReader which provides a single
line of text as a record. 

The XML messages contains "\n" and I believe it is a common scenario - many users have cross-line
records. Currently, the solution is to the extend the interface RecordReader.

To reduce the repeat work, I wrote a class named MessageRecordReader to extend the interface
RecordReader, user can set a string as record delimiter, then MessageRecordReader provides
a multiple line record to user. 

I would like to contribute the code to community. Please let me know if you are interested
in this simple but useful implementation. 

Thank you very much and happy new year!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message