crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: ParallelDo - DoFn in-order processing
Date Fri, 15 Nov 2013 16:45:26 GMT
One way, of course, is to do a group by key and force all of the records to
a single reducer.

Post-sort, I believe it's a safe assumption that the records will be
processed by a DoFn in sorted order, although it's not necessarily the case
that records with the same value of the key (if that ever happens in your
data) will be processed in the same shard/DoFn.


On Fri, Nov 15, 2013 at 8:38 AM, Hrishikesh P <>wrote:

> Hello -
> In the parallelDo-DoFn processing, is it possible to ensure that the
> records in the PTable will be processed in the given order? I have a PTable
> of long and bytes (PTable<Long, ByteBuffer>) which is sorted by the long
> value and I want to make sure that when the DoFn#process is called, the
> records will be processed in the sorted order, as there may be a dependency
> between the records.
> I thought of a few options, like storing the sorted results to a text file
> and using the file to process the records in the DoFn or using a table to
> track the records being processed but wasn't sure if they would give
> correct results and was wondering if there is a better approach.
> Thanks.

View raw message