hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Schäfer (JIRA) <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-6891) TextInputFormat: duplicate records with custom delimiter
Date Mon, 22 May 2017 16:44:04 GMT
Till Schäfer created MAPREDUCE-6891:

             Summary: TextInputFormat: duplicate records with custom delimiter
                 Key: MAPREDUCE-6891
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6891
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 2.2.0
            Reporter: Till Schäfer

When using a custom delimiter for TextInputFormat, the resulting blocks are not correct under
some circumstances. It happens that the total number of records is wrong and some entries
are duplicated.

I have created a reproducible test case: 

Generate a File
for i in $(seq 1 10000000); do 
  echo -n $i >> long_delimiter-1to10000000-with_newline.txt;
  echo "--------------------------------------------" >> long_delimiter-1to10000000-with_newline.txt;


Java-Test to reproduce the error
public static void longDelimiterBug(JavaSparkContext sc) {
	Configuration hadoopConf = new Configuration();
	String delimitedFile = "long_delimiter-1to10000000-with_newline.txt";
	hadoopConf.set("textinputformat.record.delimiter", "--------------------------------------------\n");
	JavaPairRDD<LongWritable, Text> input = sc.newAPIHadoopFile(delimitedFile, TextInputFormat.class,
			LongWritable.class, Text.class, hadoopConf);

	List<String> values = input.map(t -> t._2.toString()).collect();

	Assert.assertEquals(10000000, values.size());
	for (int i = 0; i < 10000000; i++) {
		boolean correct = values.get(i).equals(Integer.toString(i + 1));
		if (!correct) {
			logger.error("Wrong value for index {}: expected {} -> got {}", i, i + 1, values.get(i));
		} else {
			logger.info("Correct value for index {}: expected {} -> got {}", i, i + 1, values.get(i));

This example fails with the error 
java.lang.AssertionError: expected:<10000000> but was:<10042616>

when commenting out the Assert about the size of the collection, my log output ends like this:

[main] INFO  edu.udo.cs.schaefer.testspark.Main  - Correct value for index 663244: expected
663245 -> got 663245
[main] ERROR edu.udo.cs.schaefer.testspark.Main  - Wrong value for index 663245: expected
663246 -> got 660111

After the the wrong value for index 663245 the values are sorted again an a continuing with
660112, 660113, ....

The error is not reproducible with _\n_ as delimiter, i.e. when not using a custom delimiter.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org

View raw message