spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: How to check that a dataset is sorted after it has been written out?
Date Mon, 23 Mar 2015 11:31:46 GMT
Data is not (necessarily) sorted when read from disk, no. A file might
have many blocks even, and while a block yields a partition in
general, the order in which those partitions appear in the RDD is not
defined. This is why you'd sort if you need the data sorted.

I think you could conceivably make some custom RDD or InputFormat that
reads blocks in a well-defined order and, assuming the data is sorted
in some knowable way on disk, then must have them sorted. I think
that's even been brought up.

Deciding whether the data is sorted is quite different. You'd have to
decide what ordering you expect (is part 0 before part 1? should it be
sorted in a part file?) and then just verify that externally.

On Fri, Mar 20, 2015 at 10:41 PM, Michael Albert
<m_albert137@yahoo.com.invalid> wrote:
> Greetings!
>
> I sorted a dataset in Spark and then wrote it out in avro/parquet.
>
> Then I wanted to check that it was sorted.
>
> It looks like each partition has been sorted, but when reading in, the first
> "partition" (i.e., as
> seen in the partition index of mapPartitionsWithIndex) is not the same  as
> implied by
> the names of the parquet files (even when the number of partitions is the
> same in the
> rdd which was read as on disk).
>
> If I "take()" a few hundred values, they are sorted, but they are *not* the
> same as if I
> explicitly open "part-r-00000.parquet" and take values from that.
>
> It seems that when opening the rdd, the "partitions" of the rdd are not in
> the same
> order as implied by the data on disk (i.e., "part-r-00000.parquet,
> part-r-00001.parquet, etc).
>
> So, how might one read the data so that one maintains the sort order?
>
> And while on the subject, after the "terasort", how did they check that the
> data was actually sorted correctly? (or did they :-) ? ).
>
> Is there any way to read the data back in so as to preserve the sort, or do
> I need to
> "zipWithIndex" before writing it out, and write the index at that time? (I
> haven't tried the
> latter yet).
>
> Thanks!
> -Mike
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message