spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Pairwise Processing of a List
Date Mon, 26 Jan 2015 11:52:59 GMT
AFAIK ordering is not strictly guaranteed unless the RDD is the
product of a sort. I think that in practice, you'll never find
elements of a file read in some random order, for example (although
see the recent issue about partition ordering potentially depending on
how the local file system lists them).

Likewise I can't imagine you encounter elements from one Kafka
partition out of order. One receiver hears one partition and create
one block per block interval. What I'm not 100% clear on is whether
you get undefined ordering when you have multiple threads listening in
one receiver.

You can always sort RDDs by a timestamp of some sort to be sure,
although that has overheads. I'm also curious about what if anything
is guaranteed here without a sort.

On Mon, Jan 26, 2015 at 1:33 AM, Tobias Pfeiffer <tgp@preferred.jp> wrote:
> Sean,
>
> On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> Note that RDDs don't really guarantee anything about ordering though,
>> so this only makes sense if you've already sorted some upstream RDD by
>> a timestamp or sequence number.
>
>
> Speaking of order, is there some reading on guarantees and non-guarantees
> about order in RDDs? For example, when reading a file and doing
> zipWithIndex, can I assume that the lines are numbered in order? Does this
> hold for receiving data from Kafka, too?
>
> Tobias
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message