spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simone Franzini <captainfr...@gmail.com>
Subject Re: AVRO specific records
Date Fri, 07 Nov 2014 18:19:34 GMT
Ok, that turned out to be a dependency issue with Hadoop1 vs. Hadoop2 that
I have not fully solved yet. I am able to run with Hadoop1 and AVRO in
standalone mode but not with Hadoop2 (even after trying to fix the
dependencies).

Anyway, I am now trying to write to AVRO, using a very similar snippet to
the one to read from AVRO:

val withValues : RDD[(AvroKey[Subscriber], NullWritable)] = records.map{s
=> (new AvroKey(s), NullWritable.get)}
val outPath = "myOutputPath"
val writeJob = new Job()
FileOutputFormat.setOutputPath(writeJob, new Path(outPath))
AvroJob.setOutputKeySchema(writeJob, Subscriber.getClassSchema())
writeJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Any]])
records.saveAsNewAPIHadoopFile(outPath,
    classOf[AvroKey[Subscriber]],
    classOf[NullWritable],
    classOf[AvroKeyOutputFormat[Subscriber]],
    writeJob.getConfiguration)

Now, my problem is that this writes to a plain text file. I need to write
to binary AVRO. What am I missing?

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Thu, Nov 6, 2014 at 3:15 PM, Simone Franzini <captainfranz@gmail.com>
wrote:

> Benjamin,
>
> Thanks for the snippet. I have tried using it, but unfortunately I get the
> following exception. I am clueless at what might be wrong. Any ideas?
>
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:115)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Wed, Nov 5, 2014 at 4:24 PM, Laird, Benjamin <
> Benjamin.Laird@capitalone.com> wrote:
>
>> Something like this works and is how I create an RDD of specific records.
>>
>> val avroRdd = sc.newAPIHadoopFile("twitter.avro",
>> classOf[AvroKeyInputFormat[twitter_schema]],
>> classOf[AvroKey[twitter_schema]], classOf[NullWritable], conf) (From
>> https://github.com/julianpeeters/avro-scala-macro-annotation-examples/blob/master/spark/src/main/scala/AvroSparkScala.scala)
>> Keep in mind you'll need to use the kryo serializer as well.
>>
>> From: Frank Austin Nothaft <fnothaft@berkeley.edu>
>> Date: Wednesday, November 5, 2014 at 5:06 PM
>> To: Simone Franzini <captainfranz@gmail.com>
>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: Re: AVRO specific records
>>
>> Hi Simone,
>>
>> Matt Massie put together a good tutorial on his blog
>> <http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/>. If you’re
>> looking for more code using Avro, we use it pretty extensively in our
>> genomics project. Our Avro schemas are here
>> <https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl>,
>> and we have serialization code here
>> <https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization>.
>> We use Parquet for storing the Avro records, but there is also an Avro
>> HadoopInputFormat.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466
>>
>> On Nov 5, 2014, at 1:25 PM, Simone Franzini <captainfranz@gmail.com>
>> wrote:
>>
>> How can I read/write AVRO specific records?
>> I found several snippets using generic records, but nothing with specific
>> records so far.
>>
>> Thanks,
>> Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>>
>>
>> ------------------------------
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed.  If the reader of this message is not the
>> intended recipient, you are hereby notified that any review,
>> retransmission, dissemination, distribution, copying or other use of, or
>> taking of any action in reliance upon this information is strictly
>> prohibited. If you have received this communication in error, please
>> contact the sender and delete the material from your computer.
>>
>
>

Mime
View raw message