spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nira Amit (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer
Date Wed, 01 Feb 2017 23:03:52 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848981#comment-15848981
] 

Nira Amit edited comment on SPARK-19424 at 2/1/17 11:03 PM:
------------------------------------------------------------

[~srowen] thanks for elaborating, but in that case I just can't find a way to do this in java
other than loading GenericData.Record and converting it to my class after loading. From what
I've googled it appears that it's possible to do this in scala this way:
{code}
ctx.hadoopFile("/path/to/the/avro/file.avro",
  classOf[AvroInputFormat[MyClassInAvroFile]],
  classOf[AvroWrapper[MyClassInAvroFile]],
  classOf[NullWritable])
{code}
So I tried to "translate" this to java as best I could (hence the funky way for getting the
Class), but nothing works. I also tried with classes that extend AvroKey and FileInputFormat:
{code}
public static class MyCustomAvroKey extends AvroKey<MyCustomClass>{};

    public static class MyCustomAvroReader extends AvroRecordReaderBase<MyCustomAvroKey,
NullWritable, MyCustomClass> {
// with my custom schema and all the required methods...
    }
    public static class MyCustomInputFormat extends FileInputFormat<MyCustomAvroKey, NullWritable>{

        @Override
        public RecordReader<MyCustomAvroKey, NullWritable> createRecordReader(InputSplit
inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException
{
            return new MyCustomAvroReader();
        }
    }
...
JavaPairRDD<MyCustomAvroKey, NullWritable> records =
                sc.newAPIHadoopFile("file:/path/to/datafile.avro",
                        MyCustomInputFormat.class, MyCustomAvroKey.class,
                        NullWritable.class,
                        sc.hadoopConfiguration());
Tuple2<MyCustomAvroKey, NullWritable> first = records.first();
System.out.println("Got a result, id: " + first._1.datum().getSomeField()
{code}
But again the class inside MyCustomAvroKey is a GenericData.Record and not MyCustomClass:
{code}
    java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast
to my.package.containing.MyCustomClass
{code}
Am I still doing it wrong? Or is this just not possible in java?


was (Author: amitnira):
[~srowen] thanks for elaborating, but in that case I just can't find a way to do this in java
other than loading GenericData.Record and converting it to my class after loading. From what
I've googled it appears that in scala it is possible to do it this way:
{code}
ctx.hadoopFile("/path/to/the/avro/file.avro",
  classOf[AvroInputFormat[MyClassInAvroFile]],
  classOf[AvroWrapper[MyClassInAvroFile]],
  classOf[NullWritable])
{code}
So I tried to "translate" this to java as best I could (hence the funky way for getting the
Class), but nothing works. I also tried with classes that extend AvroKey and AvroKeyInputFormat:
{code}
public static class MyCustomAvroKey extends AvroKey<MyCustomClass>{};

    public static class MyCustomAvroReader extends AvroRecordReaderBase<MyCustomAvroKey,
NullWritable, MyCustomClass> {
// with my custom schema and all the required methods...
    }
    public static class MyCustomInputFormat extends FileInputFormat<MyCustomAvroKey, NullWritable>{

        @Override
        public RecordReader<MyCustomAvroKey, NullWritable> createRecordReader(InputSplit
inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException
{
            return new MyCustomAvroReader();
        }
    }
...
JavaPairRDD<MyCustomAvroKey, NullWritable> records =
                sc.newAPIHadoopFile("file:/path/to/datafile.avro",
                        MyCustomInputFormat.class, MyCustomAvroKey.class,
                        NullWritable.class,
                        sc.hadoopConfiguration());
Tuple2<MyCustomAvroKey, NullWritable> first = records.first();
System.out.println("Got a result, id: " + first._1.datum().getSomeField()
{code}
But again the class inside MyCustomAvroKey is a GenericData.Record and not MyCustomClass:
{code}
    java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast
to my.package.containing.MyCustomClass
{code}
Am I still doing it wrong? Or is this just not possible in java?

> Wrong runtime type in RDD when reading from avro with custom serializer
> -----------------------------------------------------------------------
>
>                 Key: SPARK-19424
>                 URL: https://issues.apache.org/jira/browse/SPARK-19424
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.0.2
>         Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>            Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code compiles fine,
but in runtime I'm getting a ClassCastException. Here is what my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
>     kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD<MyCustomClass, NullWritable> records =
>                 sc.newAPIHadoopFile("file:/path/to/datafile.avro",
>                 AvroKeyInputFormat.class, MyCustomClass.class, NullWritable.class,
>                 sc.hadoopConfiguration());
> Tuple2<MyCustomClass, NullWritable> first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD has a kClassTag
of my.package.containing.MyCustomClass, the variable first contains a Tuple2<AvroKey, NullWritable>,
not Tuple2<MyCustomClass, NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error rather than
a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message