spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Wrong runtime type when using newAPIHadoopFile in Java
Date Mon, 06 Mar 2017 12:02:12 GMT
I think this is the same thing we already discussed extensively on your
JIRA.

The type of the key/value class argument to newAPIHadoopFile are not the
type of your custom class, but of the Writable describing encoding of keys
and values in the file. I think that's the start of part of the problem.
This is how all Hadoop-related APIs would work, because Hadoop uses
Writables for encoding.

You're asking again why it isn't caught at compile time, and that stems
from two basic causes. First is the way the underlying Hadoop API works,
needing Class parameters because of it's Java roots. Second is the
Scala/Java difference; the Scala API will accept, for instance,
non-Writable arguments if you can supply implicit conversion to Writable
(if I recall correctly). This isn't available in Java, leaving its API
expressing flexibility that isn't there. This isn't the exact issue here;
it's that you're using raw class literals in Java which have no generic
types -- they are Class<?>. The InputFormat arg expresses nothing about the
key/value types; there's nothing to 'contradict' your declaration, which is
doesn't represent the actual types correctly. (You can cast class literals
to (Class<..>) to express this if you want. It's a little mess in Java.)
That's why it compiles just as any Java code with an invalid cast compiles
but fails at runtime.

It is a bit weird if you're not familiar with the Hadoop APIs, Writables,
or how Class arguments shake out in the context of generics. It does take
the research you did. It does work as you've found. The reason you were
steered several times to the DataFrame API is that it can hide a lot of
this from you, including details of Avro and Writables. You're directly
accessing Hadoop APIs that are foreign to you.

This and the JIRA do not describe a bug.



On Mon, Mar 6, 2017 at 11:29 AM Nira <amitnira@gmail.com> wrote:

> I tried to load a custom type from avro files into a RDD using the
> newAPIHadoopFile. I started with the following naive code:
>
> JavaPairRDD<MyCustomClass, NullWritable> events =
>                 sc.newAPIHadoopFile("file:/path/to/data.avro",
>                 AvroKeyInputFormat.class, MyCustomClass.class,
> NullWritable.class,
>                 sc.hadoopConfiguration());
> Tuple2<MyCustomClass, NullWritable> first = events.first();
>
> This doesn't work and shouldn't work, because the AvroKeyInputFormat
> returns
> a GenericData$Record. The thing is it compiles, and you can even assign the
> first tuple to the variable "first". You will get a runtime error only when
> you try to access a field of MyCustomClass from the tuple (e.g
> first._1.getSomeField()).
> This behavior sent me on a wild goose chase that took many hours over many
> weeks to figure out, because I never expected the method to return a wrong
> type at runtime. If there's a mismatch between what the InputFormat returns
> and the class I'm trying to load - shouldn't this be a compilation error?
> Or
> at least the runtime error should occur already when I try to assign the
> tuple to a variable of the wrong type. This is very unexpected behavior.
>
> Moreover, I actually fixed my code and implemented all the required wrapper
> and custom classes:
> JavaPairRDD<MyCustomAvroKey, NullWritable> records =
>                 sc.newAPIHadoopFile("file:/path/to/data.avro",
>                         MyCustomInputFormat.class, MyCustomAvroKey.class,
>                         NullWritable.class,
>                         sc.hadoopConfiguration());
> Tuple2<MyCustomAvroKey, NullWritable> first = records.first();
> MyCustomAvroKey customKey = first._1;
>
> But this time I forgot that I moved the class to another package so the
> namespace in the schema file was wrong. And again, in runtime the method
> datum() of customKey returned a GenericData$Record instead of a
> MyCustomClass.
>
> Now, I understand that this has to do with the avro library (the
> GenericDatumReader class has an "expected" and "actual" schema, and it
> defaults to a GenericData$Record if something is wrong with my schema). But
> does it really make sense to return a different class from this API, which
> is not even assignable to my class, when this happens? Why would I ever get
> a class U from a wrapper class declared to be a Wrapper<T>? It's just
> confusing and makes it so much harder to pinpoint the real problem.
>
> As I said, this weird behavior cost me a lot of time, and I've been
> googling
> this for weeks and am getting the impression that very few Java developers
> figured this API out. I posted  a question
> <
> http://stackoverflow.com/questions/41836851/wrong-runtime-type-in-rdd-when-reading-from-avro-with-custom-serializer
> >
> about it in StackOverflow and got several views and upvotes but no replies
> (a  similar question
> <
> http://stackoverflow.com/questions/41834120/override-avroio-default-coder-in-dataflow/
> >
> about loading custom types in Google Dataflow got answered within a couple
> of days).
>
> I think this behavior should be considered a bug.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Wrong-runtime-type-when-using-newAPIHadoopFile-in-Java-tp28459.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message