crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Urso McDaniel <>
Subject Concerning the use of the Iterable parameter to CombineFn
Date Fri, 05 Apr 2013 19:30:14 GMT
BLUF: The Iterable parameter to CombineFn.process implies you can iterate
multiple times when you cannot and this leads to surprising behavior.

As many of you probably know, the signature of CombineFn.process is
process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>> emitter)

The corresponding Hadoop Reducer signature is
reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter

I assume the Crunch use of Iterable is for convenient use in "for" loops.

Unfortunately, the behavior of this Iterable seems to return the same
Iterator object each time Iterable.iterator() is called.

This makes sense to me based on the underlying hadoop mapreduce, but
violates what I think most expect from the Iterable interface.

I understand that it's too late to change the interface, but could we at
least have an javadoc or an exception thrown if the Iterable is used more
than once?

View raw message