crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-192) Document and enforce the semantics around reducer-based Iterables
Date Sat, 06 Apr 2013 18:47:16 GMT


Josh Wills commented on CRUNCH-192:

Hey Gabriel, I'm +1 for this, but I'd like to have the same semantics for the in-memory pipeline,
so that my unit tests would fail quickly if I was doing multiple passes over the data using
the same Iterable instance.
> Document and enforce the semantics around reducer-based Iterables
> -----------------------------------------------------------------
>                 Key: CRUNCH-192
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>         Attachments: CRUNCH-192.patch
> As reported on by Chad Urso McDaniel:
> BLUF: The Iterable parameter to CombineFn.process implies you can iterate multiple times
when you cannot and this leads to surprising behavior.
> As many of you probably know, the signature of CombineFn.process is 
> ---
> process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>> emitter)
> ---
> The corresponding Hadoop Reducer signature is
> ---
> reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter
> ---
> I assume the Crunch use of Iterable is for convenient use in "for" loops.
> Unfortunately, the behavior of this Iterable seems to return the same Iterator object
each time Iterable.iterator() is called. 
> This makes sense to me based on the underlying hadoop mapreduce, but violates what I
think most expect from the Iterable interface.
> I understand that it's too late to change the interface, but could we at least have an
javadoc or an exception thrown if the Iterable is used more than once?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message