spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Albert <>
Subject How to check that a dataset is sorted after it has been written out?
Date Fri, 20 Mar 2015 22:41:16 GMT
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first "partition" (i.e.,
as seen in the partition index of mapPartitionsWithIndex) is not the same  as implied by the
names of the parquet files (even when the number of partitions is the same in therdd which
was read as on disk).
If I "take()" a few hundred values, they are sorted, but they are *not* the same as if I explicitly
open "part-r-00000.parquet" and take values from that.
It seems that when opening the rdd, the "partitions" of the rdd are not in the sameorder as
implied by the data on disk (i.e., "part-r-00000.parquet, part-r-00001.parquet, etc).
So, how might one read the data so that one maintains the sort order?
And while on the subject, after the "terasort", how did they check that the data was actually
sorted correctly? (or did they :-) ? ).
Is there any way to read the data back in so as to preserve the sort, or do I need to "zipWithIndex"
before writing it out, and write the index at that time? (I haven't tried the latter yet).

View raw message