Hi Chad,
Good point -- I know that this has tripped people up in the past. I think that definitely
documenting this and possibly enforcing it sounds like a good idea -- I've logged a ticket
in JIRA (with the content of your mail), see https://issues.apache.org/jira/browse/CRUNCH-192
- Gabriel
On 05 Apr 2013, at 21:30, Chad Urso McDaniel <chadum@gmail.com> wrote:
> BLUF: The Iterable parameter to CombineFn.process implies you can iterate multiple times
when you cannot and this leads to surprising behavior.
>
> As many of you probably know, the signature of CombineFn.process is
> ---
> process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>> emitter)
> ---
>
> The corresponding Hadoop Reducer signature is
> ---
> reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter
reporter)
> ---
>
> I assume the Crunch use of Iterable is for convenient use in "for" loops.
>
> Unfortunately, the behavior of this Iterable seems to return the same Iterator object
each time Iterable.iterator() is called.
>
> This makes sense to me based on the underlying hadoop mapreduce, but violates what I
think most expect from the Iterable interface.
>
> I understand that it's too late to change the interface, but could we at least have an
javadoc or an exception thrown if the Iterable is used more than once?
|