spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: Spark incorrectly collects data from Hadoop SequenceFile
Date Mon, 30 Sep 2013 22:37:41 GMT
https://groups.google.com/forum/?fromgroups=#!searchin/spark-users/reuse/spark-users/ztODmLlUlwc/Se2MTK5IU3EJ


On Mon, Sep 30, 2013 at 3:06 PM, Sergey Parhomenko <sparhomenko@gmail.com>wrote:

> Hi,
>
> We tried to use *JavaPairRDD.sortByKey()* and were not able to. I'm not
> fully sure if that's a bug or we are using APIs incorrectly, so would like
> to crosscheck on the mailing list first. The unit test is attached.
> Essentially, we create Hadoop sequence file and write different key/value
> pairs there. Then we use *JavaSparkContext.sequenceFile().collect()* to
> read the same pairs. The data we get, however, is not the data we sent - we
> get the same row over and over again. That seems to be caused by the code
> in *HadoopRDD.compute()* which creates mutable key and value once, and
> reuses them for each iterated tuple. While this works fine if we just need
> to calculate something based on the data, it does not work if we need to
> collect some of that data. It doesn't work both when using Java
> serialization (*org.apache.hadoop.io.serializer.JavaSerialization*) and
> default Hadoop serialization (*
> org.apache.hadoop.io.serializer.WritableSerialization*), which is
> demonstrated by corresponding test methods. For the same reason *
> JavaPairRDD.sortByKey()* does not work, which is actually our main
> problem, also demonstrated in a separate method.
>
> If this is indeed a bug we can raise an issue in JIRA.
>
> --
> Best regards,
> Sergey Parhomenko
>

Mime
View raw message