spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Crystal Xing <crystalxin...@gmail.com>
Subject Re: Spark distinct() returns incorrect results for some types?
Date Thu, 11 Jun 2015 20:14:39 GMT
I see. It makes a lot of sense now. It is not unique to spark but it would
be great if it is mentioned in spark documentation.

I have been using hadoop for a while and I am not aware of it!


Zheng zheng

On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs <wrbriggs@gmail.com> wrote:

> To be fair, this is a long-standing issue due to optimizations for object
> reuse in the Hadoop API, and isn't necessarily a failing in Spark - see
> this blog post (
> https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/)
> from 2011 documenting a similar issue.
>
>
> On June 11, 2015, at 3:17 PM, Sean Owen <sowen@cloudera.com> wrote:
>
>
> Yep you need to use a transformation of the raw value; use toString for
> example.
>
> On Thu, Jun 11, 2015, 8:54 PM Crystal Xing <crystalxing06@gmail.com>
> wrote:
>
>> That is a little scary.
>>  So you mean in general, we shouldn't use hadoop's writable as Key in
>> RDD?
>>
>> Zheng zheng
>>
>> On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> Guess: it has something to do with the Text object being reused by
>>> Hadoop? You can't in general keep around refs to them since they change. So
>>> you may have a bunch of copies of one object at the end that become just
>>> one in each partition.
>>>
>>> On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxing06@gmail.com>
>>> wrote:
>>>
>>>> I load a   list of ids from a text file as NLineInputFormat, and when I
>>>> do distinct(), it returns incorrect number.
>>>>  JavaRDD<Text> idListData = jvc
>>>>                 .hadoopFile(idList, NLineInputFormat.class,
>>>>                         LongWritable.class,
>>>> Text.class).values().distinct()
>>>>
>>>>
>>>> I should have 7000K distinct value, how every it only returns 7000
>>>> values, which is the same as number of tasks.  The type I am using is
>>>> import org.apache.hadoop.io.Text;
>>>>
>>>>
>>>> However,  if I switch to use String instead of Text, it works correcly.
>>>>
>>>> I think the Text class should have correct implementation of equals()
>>>> and hashCode() functions since it is the hadoop class.
>>>>
>>>> Does anyone have clue what is going on?
>>>>
>>>> I am using spark 1.2.
>>>>
>>>> Zheng zheng
>>>>
>>>>
>>>>
>>

Mime
View raw message