No, I'm fine w/having Trevni support in Crunch. It should stay compatible w/whatever version of Avro we have, so no new dependency issues AFAIK.


On Wed, Apr 24, 2013 at 4:21 PM, Micah Whitacre <mkwhitacre@gmail.com> wrote:
Logged:
https://issues.apache.org/jira/browse/CRUNCH-197

Do you see any issues with adding Trevni support to Crunch?  Something you'd rather avoid doing in lieu of waiting on Parquet?  


On Wed, Apr 24, 2013 at 6:07 PM, Josh Wills <jwills@cloudera.com> wrote:



On Wed, Apr 24, 2013 at 4:02 PM, Micah Whitacre <mkwhitacre@gmail.com> wrote:
I think it's just that. It seems relatively low-risk to me (e.g., we already use AvroKey in the AvroPairConverter for PTables).

Ok sounds good.  Do you want me to log a bug for this?

Yes please. I'm running the small fix through regression tests now.
 

I'm also curious if you're looking at Parquet for this use case?

Yeah was going to look at it after Trevni.  It's Avro support is not as far along (looks like ~16 days).  The goal was to hopefully help get support for both into Crunch eventually and we can choose whichever is better for our job.

Fair enough.
 

On Wed, Apr 24, 2013 at 5:52 PM, Josh Wills <jwills@cloudera.com> wrote:



On Wed, Apr 24, 2013 at 3:49 PM, Micah Whitacre <mkwhitacre@gmail.com> wrote:
Is the change simply:

  private AvroWrapper<K> getWrapper() {
    if (wrapper == null) {
      // wrapper = new AvroWrapper<K>();
      wrapper = new AvroKey<K>();
    }
    return wrapper;
  }

Or are there more changes I might be missing?  Doing that got me past the ClassCastException (though still trying to get my code working).

As I indicated I'm still just trying to prove out my code and if it pans out we can probably wait till the 0.7.0 release (assuming the current ~2 month release cycle).  I'll leave it to you to evaluate the risk. 

I think it's just that. It seems relatively low-risk to me (e.g., we already use AvroKey in the AvroPairConverter for PTables).
 

I'm guessing the injecting a converter issue will be more significant if I try out the other Trevni format[1] where I'd need the converter to support AvroValue instead of NullWritable.  So I'm fine with holding off a rushed change before a release in lieu of a more holistic solution to both parts.


I'm also curious if you're looking at Parquet for this use case?
 



On Wed, Apr 24, 2013 at 5:29 PM, Josh Wills <jwills@cloudera.com> wrote:
Hey Micah,

It seems like having the AvroKeyConverter use the AvroKey as the return type instead of AvroWrapper is the easiest way to solve this, since AvroKey is a subclass of AvroWrapper. That said, I agree, that's a thorny problem. We're just getting ready for the 0.6.0 release, but I'd be fine to get the switch in there if that solved this problem for you.

J


On Wed, Apr 24, 2013 at 3:23 PM, Micah Whitacre <mkwhitacre@gmail.com> wrote:
As an alternative to the standard AvroInput/OutputFormat, I've been playing around with how to support alternate Avro file types like Trevni[1], which give benefits when we want to only retrieve a subset of the Avro object.  

Picking one of the implementations (AvroTrevniKeyInputFormat/AvroTrevniKeyOutputFormat)[2], I implemented the various Source/Target/SourceTarget implementations.  When I started trying to test it out (to see if I did any of it right), I hit the issue that the AvroKeyConverter only produces AvroWrapper objects and the output format requires AvroKey.  So I get ClassCastExceptions CrunchOutputs.write(...) method.

Caused by: java.lang.ClassCastException: org.apache.avro.mapred.AvroWrapper cannot be cast to org.apache.avro.mapred.AvroKey
at org.apache.trevni.avro.mapreduce.AvroTrevniKeyRecordWriter.write(AvroTrevniKeyRecordWriter.java:34)
at org.apache.crunch.io.CrunchOutputs.write(CrunchOutputs.java:129)

I was hoping that the target would be able to take any PCollection<? extends AvroType> but it looks like I'd need to implement my own PType and force consumers to use that just to change the converter to produce AvroKey instead.  

Is implementing a custom PType the only way to inject an alternate converter?  That seems like a high cost on the implementation side and forcing a restriction onto others in the pipeline who are generally happy with the standard AvroType and shouldn't be burdened with how the data might be stored later on in the processing.




--
Director of Data Science
Twitter: @josh_wills




--
Director of Data Science
Twitter: @josh_wills




--
Director of Data Science
Twitter: @josh_wills




--
Director of Data Science
Cloudera
Twitter: @josh_wills