crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-51) PCollection#sort relies on using a single reducer for total order sorting
Date Sun, 24 Mar 2013 15:21:16 GMT


Gabriel Reid commented on CRUNCH-51:

Just took a look through this as well as giving it a pretty thorough test run.

The sorting stuff looks good to me, although I did run into a separate issue with object-reuse
issues with the reservoir sampling (for which I've created CRUNCH-183). 

This is good to commit as far as I'm concerned. Nice one on getting this one taken care of;
it was really important but also undoubtedly not simple to get it done.
> PCollection#sort relies on using a single reducer for total order sorting
> -------------------------------------------------------------------------
>                 Key: CRUNCH-51
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>         Attachments: 0001-CRUNCH-51-Total-Order-Sort.patch, CRUNCH-51.patch, CRUNCH-51.patch,
CRUNCH-51.patch, CRUNCH-51-rat.txt, CRUNCH-51-ready.patch,
> The total-order sorting provided by the Sort class (and therefore PCollection#sort) relies
on using a single reducer in order to provide total-order sorting. This is very inefficient
for large datasets, and should be replaced with a total order partitioner instead.
> For more information, see CRUNCH-23 (and possibly also MAPREDUCE-4574).

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message