I think this is the same thing we already discussed extensively on your JIRA.

The type of the key/value class argument to newAPIHadoopFile are not the type of your custom class, but of the Writable describing encoding of keys and values in the file. I think that's the start of part of the problem. This is how all Hadoop-related APIs would work, because Hadoop uses Writables for encoding.

You're asking again why it isn't caught at compile time, and that stems from two basic causes. First is the way the underlying Hadoop API works, needing Class parameters because of it's Java roots. Second is the Scala/Java difference; the Scala API will accept, for instance, non-Writable arguments if you can supply implicit conversion to Writable (if I recall correctly). This isn't available in Java, leaving its API expressing flexibility that isn't there. This isn't the exact issue here; it's that you're using raw class literals in Java which have no generic types -- they are Class<?>. The InputFormat arg expresses nothing about the key/value types; there's nothing to 'contradict' your declaration, which is doesn't represent the actual types correctly. (You can cast class literals to (Class<..>) to express this if you want. It's a little mess in Java.) That's why it compiles just as any Java code with an invalid cast compiles but fails at runtime.

It is a bit weird if you're not familiar with the Hadoop APIs, Writables, or how Class arguments shake out in the context of generics. It does take the research you did. It does work as you've found. The reason you were steered several times to the DataFrame API is that it can hide a lot of this from you, including details of Avro and Writables. You're directly accessing Hadoop APIs that are foreign to you.

This and the JIRA do not describe a bug.



On Mon, Mar 6, 2017 at 11:29 AM Nira <amitnira@gmail.com> wrote:
I tried to load a custom type from avro files into a RDD using the
newAPIHadoopFile. I started with the following naive code:

JavaPairRDD<MyCustomClass, NullWritable> events =
                sc.newAPIHadoopFile("file:/path/to/data.avro",
                AvroKeyInputFormat.class, MyCustomClass.class,
NullWritable.class,
                sc.hadoopConfiguration());
Tuple2<MyCustomClass, NullWritable> first = events.first();

This doesn't work and shouldn't work, because the AvroKeyInputFormat returns
a GenericData$Record. The thing is it compiles, and you can even assign the
first tuple to the variable "first". You will get a runtime error only when
you try to access a field of MyCustomClass from the tuple (e.g
first._1.getSomeField()).
This behavior sent me on a wild goose chase that took many hours over many
weeks to figure out, because I never expected the method to return a wrong
type at runtime. If there's a mismatch between what the InputFormat returns
and the class I'm trying to load - shouldn't this be a compilation error? Or
at least the runtime error should occur already when I try to assign the
tuple to a variable of the wrong type. This is very unexpected behavior.

Moreover, I actually fixed my code and implemented all the required wrapper
and custom classes:
JavaPairRDD<MyCustomAvroKey, NullWritable> records =
                sc.newAPIHadoopFile("file:/path/to/data.avro",
                        MyCustomInputFormat.class, MyCustomAvroKey.class,
                        NullWritable.class,
                        sc.hadoopConfiguration());
Tuple2<MyCustomAvroKey, NullWritable> first = records.first();
MyCustomAvroKey customKey = first._1;

But this time I forgot that I moved the class to another package so the
namespace in the schema file was wrong. And again, in runtime the method
datum() of customKey returned a GenericData$Record instead of a
MyCustomClass.

Now, I understand that this has to do with the avro library (the
GenericDatumReader class has an "expected" and "actual" schema, and it
defaults to a GenericData$Record if something is wrong with my schema). But
does it really make sense to return a different class from this API, which
is not even assignable to my class, when this happens? Why would I ever get
a class U from a wrapper class declared to be a Wrapper<T>? It's just
confusing and makes it so much harder to pinpoint the real problem.

As I said, this weird behavior cost me a lot of time, and I've been googling
this for weeks and am getting the impression that very few Java developers
figured this API out. I posted  a question
<http://stackoverflow.com/questions/41836851/wrong-runtime-type-in-rdd-when-reading-from-avro-with-custom-serializer>
about it in StackOverflow and got several views and upvotes but no replies
(a  similar question
<http://stackoverflow.com/questions/41834120/override-avroio-default-coder-in-dataflow/>
about loading custom types in Google Dataflow got answered within a couple
of days).

I think this behavior should be considered a bug.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Wrong-runtime-type-when-using-newAPIHadoopFile-in-Java-tp28459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org