spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From thomas j <beanb...@googlemail.com>
Subject Re: How can I read this avro file using spark & scala?
Date Fri, 21 Nov 2014 13:04:11 GMT
I've been able to load a different avro file based on GenericRecord with:

val person = sqlContext.avroFile("/tmp/person.avro")

When I try to call `first()` on it, I get "NotSerializableException"
exceptions again:

person.first()

...
14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
20)
java.io.NotSerializableException: org.apache.avro.generic.GenericData$Record
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
    at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
    ...

Apart from this I want to transform the records into pairs of (user_id,
record). I can do this by specifying the offset of the user_id column with
something like this:

person.map(r => (r.getInt(2), r)).take(4).collect()

Is there any way to be able to specify the column name ("user_id") instead
of needing to know/calculate the offset somehow?

Thanks again


On Fri, Nov 21, 2014 at 11:48 AM, thomas j <beanboy3@googlemail.com> wrote:

> Thanks for the pointer Michael.
>
> I've downloaded spark 1.2.0 from
> https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and
> built the spark-avro repo you linked to.
>
> When I run it against the example avro file linked to in the documentation
> it works. However, when I try to load my avro file (linked to in my
> original question) I receive the following error:
>
> java.lang.RuntimeException: Unsupported type LONG
>     at scala.sys.package$.error(package.scala:27)
>     at com.databricks.spark.avro.AvroRelation.com
> $databricks$spark$avro$AvroRelation$$toSqlType(AvroRelation.scala:116)
>     at
> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:97)
>     at
> com.databricks.spark.avro.AvroRelation$$anonfun$5.apply(AvroRelation.scala:96)
>     at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>     at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>     ...
>
> If this is useful I'm happy to try loading the various different avro
> files I have to try to battle-test spark-avro.
>
> Thanks
>
> On Thu, Nov 20, 2014 at 6:30 PM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> One option (starting with Spark 1.2, which is currently in preview) is to
>> use the Avro library for Spark SQL.  This is very new, but we would love to
>> get feedback: https://github.com/databricks/spark-avro
>>
>> On Thu, Nov 20, 2014 at 10:19 AM, al b <beanboy3@googlemail.com> wrote:
>>
>>> I've read several posts of people struggling to read avro in spark. The
>>> examples I've tried don't work. When I try this solution (
>>> https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files)
>>> I get errors:
>>>
>>> spark java.io.NotSerializableException:
>>> org.apache.avro.mapred.AvroWrapper
>>>
>>> How can I read the following sample file in spark using scala?
>>>
>>> http://www.4shared.com/file/SxnYcdgJce/sample.html
>>>
>>> Thomas
>>>
>>
>>
>

Mime
View raw message