spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nira Amit <>
Subject Re: Wrong runtime type when using newAPIHadoopFile in Java
Date Mon, 06 Mar 2017 12:25:09 GMT
Hi Sean,
Yes, we discussed this in Jira and you suggested I take this discussion to
the mailing list, so I did.
I don't have the option to migrate the code I'm working on to Datasets at
the moment (or to Scala, as another developer suggested in the Jira
discussion), so I have to work with the the Java RDD API.
I've been working with Java for many years and understand that not all type
errors can be caught in compile time. What I don't understand is how you
manage to create an object of type AvroKey<MyCustomType> with the actual
datum it encloses being GenericData$Record. If my code threw a
RuntimeException in the line `MyCustomAvroKey customKey = first._1;` for
example, saying it has a AvroKey<GenericData$Record> - then there would be
no confusion. But what happens in practice is that somehow my customKey is
of type AvroKey<MyCustomType> and only when I try to retrieve the
MyCustomType datum I get the exception. There must be some hackish things
going on under the hood here, because this is just not how Java is supposed
to work.
Which is why I still think that this should be considered a bug.

On Mon, Mar 6, 2017 at 1:02 PM, Sean Owen <> wrote:

> I think this is the same thing we already discussed extensively on your
> The type of the key/value class argument to newAPIHadoopFile are not the
> type of your custom class, but of the Writable describing encoding of keys
> and values in the file. I think that's the start of part of the problem.
> This is how all Hadoop-related APIs would work, because Hadoop uses
> Writables for encoding.
> You're asking again why it isn't caught at compile time, and that stems
> from two basic causes. First is the way the underlying Hadoop API works,
> needing Class parameters because of it's Java roots. Second is the
> Scala/Java difference; the Scala API will accept, for instance,
> non-Writable arguments if you can supply implicit conversion to Writable
> (if I recall correctly). This isn't available in Java, leaving its API
> expressing flexibility that isn't there. This isn't the exact issue here;
> it's that you're using raw class literals in Java which have no generic
> types -- they are Class<?>. The InputFormat arg expresses nothing about the
> key/value types; there's nothing to 'contradict' your declaration, which is
> doesn't represent the actual types correctly. (You can cast class literals
> to (Class<..>) to express this if you want. It's a little mess in Java.)
> That's why it compiles just as any Java code with an invalid cast compiles
> but fails at runtime.
> It is a bit weird if you're not familiar with the Hadoop APIs, Writables,
> or how Class arguments shake out in the context of generics. It does take
> the research you did. It does work as you've found. The reason you were
> steered several times to the DataFrame API is that it can hide a lot of
> this from you, including details of Avro and Writables. You're directly
> accessing Hadoop APIs that are foreign to you.
> This and the JIRA do not describe a bug.
> On Mon, Mar 6, 2017 at 11:29 AM Nira <> wrote:
>> I tried to load a custom type from avro files into a RDD using the
>> newAPIHadoopFile. I started with the following naive code:
>> JavaPairRDD<MyCustomClass, NullWritable> events =
>>                 sc.newAPIHadoopFile("file:/path/to/data.avro",
>>                 AvroKeyInputFormat.class, MyCustomClass.class,
>> NullWritable.class,
>>                 sc.hadoopConfiguration());
>> Tuple2<MyCustomClass, NullWritable> first = events.first();
>> This doesn't work and shouldn't work, because the AvroKeyInputFormat
>> returns
>> a GenericData$Record. The thing is it compiles, and you can even assign
>> the
>> first tuple to the variable "first". You will get a runtime error only
>> when
>> you try to access a field of MyCustomClass from the tuple (e.g
>> first._1.getSomeField()).
>> This behavior sent me on a wild goose chase that took many hours over many
>> weeks to figure out, because I never expected the method to return a wrong
>> type at runtime. If there's a mismatch between what the InputFormat
>> returns
>> and the class I'm trying to load - shouldn't this be a compilation error?
>> Or
>> at least the runtime error should occur already when I try to assign the
>> tuple to a variable of the wrong type. This is very unexpected behavior.
>> Moreover, I actually fixed my code and implemented all the required
>> wrapper
>> and custom classes:
>> JavaPairRDD<MyCustomAvroKey, NullWritable> records =
>>                 sc.newAPIHadoopFile("file:/path/to/data.avro",
>>                         MyCustomInputFormat.class, MyCustomAvroKey.class,
>>                         NullWritable.class,
>>                         sc.hadoopConfiguration());
>> Tuple2<MyCustomAvroKey, NullWritable> first = records.first();
>> MyCustomAvroKey customKey = first._1;
>> But this time I forgot that I moved the class to another package so the
>> namespace in the schema file was wrong. And again, in runtime the method
>> datum() of customKey returned a GenericData$Record instead of a
>> MyCustomClass.
>> Now, I understand that this has to do with the avro library (the
>> GenericDatumReader class has an "expected" and "actual" schema, and it
>> defaults to a GenericData$Record if something is wrong with my schema).
>> But
>> does it really make sense to return a different class from this API, which
>> is not even assignable to my class, when this happens? Why would I ever
>> get
>> a class U from a wrapper class declared to be a Wrapper<T>? It's just
>> confusing and makes it so much harder to pinpoint the real problem.
>> As I said, this weird behavior cost me a lot of time, and I've been
>> googling
>> this for weeks and am getting the impression that very few Java developers
>> figured this API out. I posted  a question
>> <
>> runtime-type-in-rdd-when-reading-from-avro-with-custom-serializer>
>> about it in StackOverflow and got several views and upvotes but no replies
>> (a  similar question
>> <
>> avroio-default-coder-in-dataflow/>
>> about loading custom types in Google Dataflow got answered within a couple
>> of days).
>> I think this behavior should be considered a bug.
>> --
>> View this message in context: http://apache-spark-user-list.
>> newAPIHadoopFile-in-Java-tp28459.html
>> Sent from the Apache Spark User List mailing list archive at
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:

View raw message