spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Miller <cmiller11...@gmail.com>
Subject Repeating Records w/ Spark + Avro?
Date Fri, 11 Mar 2016 23:19:01 GMT
I have a bit of a strange situation:

*****************
import org.apache.avro.generic.{GenericData, GenericRecord}
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper, AvroKey}
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.{NullWritable, WritableUtils}

val path = "/path/to/data.avro"

val rdd = sc.newAPIHadoopFile(path,
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable])
rdd.take(10).foreach( x => println( x._1.datum() ))
*****************

In this situation, I get the right number of records returned, and if I
look at the contents of rdd I see the individual records as tuple2's...
however, if I println on each one as shown above, I get the same result
every time.

Apparently this has to do with something in Spark or Avro keeping a
reference to the item its iterating over, so I need to clone the object
before I use it. However, if I try to clone it (from the spark-shell
console), I get:

*****************
rdd.take(10).foreach( x => {
  val clonedDatum = x._1.datum().clone()
  println(clonedDatum.datum())
})

<console>:37: error: method clone in class Object cannot be accessed in
org.apache.avro.generic.GenericRecord
 Access to protected method clone not permitted because
 prefix type org.apache.avro.generic.GenericRecord does not conform to
 class $iwC where the access take place
                val clonedDatum = x._1.datum().clone()
*****************

So, how can I clone the datum?

Seems I'm not the only one who ran into this problem:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I can't
figure out how to fix it in my case without hacking away like the person in
the linked PR did.

Suggestions?

--
Chris Miller

Mime
View raw message