spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "M. Dale" <medal...@yahoo.com.INVALID>
Subject Re: spark sql writing in avro
Date Fri, 13 Mar 2015 20:57:29 GMT
    I probably did not do a good enough job explaining the problem. If 
you used Maven with the
default Maven repository you have an old version of spark-avro that does 
not contain AvroSaver and does not have the saveAsAvro method implemented:

Assuming you use the default Maven repo location:
cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1
jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver

Comes up empty. The jar file does not contain this class because 
AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November.

So:
git clone git@github.com:databricks/spark-avro.git
cd spark-avro
sbt publish-m2

This publishes the latest master code (this includes AvroSaver etc.) to 
your local Maven repo and Maven will pick up the latest version of 
spark-avro (for this machine).

Now you should be able to compile and run.

HTH,
Markus

On 03/12/2015 11:55 PM, Kevin Peng wrote:
> Dale,
>
> I basically have the same maven dependency above, but my code will not 
> compile due to not being able to reference to AvroSaver, though the 
> saveAsAvro reference compiles fine, which is weird.  Eventhough 
> saveAsAvro compiles for me, it errors out when running the spark job 
> due to it not being implemented (the job quits and says non 
> implemented method or something along those lines).
>
> I will try going the spark shell and passing in the jar built from 
> github since I haven't tried that quite yet.
>
> On Thu, Mar 12, 2015 at 6:44 PM, M. Dale <medale94@yahoo.com 
> <mailto:medale94@yahoo.com>> wrote:
>
>     Short answer: if you downloaded spark-avro from the
>     repo.maven.apache.org <http://repo.maven.apache.org>
>     repo you might be using an old version (pre-November 14, 2014) -
>     see timestamps at
>     http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>     Lots of changes at https://github.com/databricks/spark-avro since
>     then.
>
>     Databricks, thank you for sharing the Avro code!!!
>
>     Could you please push out the latest version or update the version
>     number and republish to repo.maven.apache.org
>     <http://repo.maven.apache.org> (I have no idea how jars get
>     there). Or is there a different repository that users should point
>     to for
>     this artifact?
>
>     Workaround: Download from https://github.com/databricks/spark-avro
>     and build
>     with latest functionality (still version 0.1) and add to your
>     local Maven
>     or Ivy repo.
>
>     Long version:
>     I used a default Maven build and declared my dependency on:
>
>             <dependency>
>                 <groupId>com.databricks</groupId>
>                 <artifactId>spark-avro_2.10</artifactId>
>                 <version>0.1</version>
>             </dependency>
>
>     Maven downloaded the 0.1 version from
>     http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>     and included it in my app code jar.
>
>     From spark-shell:
>
>     import com.databricks.spark.avro._
>     import org.apache.spark.sql.SQLContext
>     val sqlContext = new SQLContext(sc)
>
>     # This schema includes LONG for time in millis
>     (https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl)
>     val recordsSchema =
>     sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
>     java.lang.RuntimeException: Unsupported type LONG
>
>     However, checking out the spark-avro code from its GitHub repo and
>     adding
>     a test case against the MailRecord avro everything ran fine.
>
>     So I built the databricks spark-avro locally on my box and then
>     put it in my
>     local Maven repo - everything worked from spark-shell when adding
>     that jar
>     as dependency.
>
>     Hope this helps for the "save" case as well. On the pre-14NOV
>     version, avro.scala
>     says:
>      // TODO: Implement me.
>       implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
>         def saveAsAvroFile(path: String): Unit = ???
>       }
>
>     Markus
>
>     On 03/12/2015 07:05 PM, kpeng1 wrote:
>
>         Hi All,
>
>         I am current trying to write out a scheme RDD to avro.  I
>         noticed that there
>         is a databricks spark-avro library and I have included that in my
>         dependencies, but it looks like I am not able to access the
>         AvroSaver
>         object.  On compilation of the job I get this:
>         error: not found: value AvroSaver
>         [ERROR]     AvroSaver.save(resultRDD, args(4))
>
>         I also tried calling saveAsAvro on the resultRDD(the actual
>         rdd with the
>         results) and that passes compilation, but when I run the code
>         I get an error
>         that says the saveAsAvro is not implemented.  I am using
>         version 0.1 of
>         spark-avro_2.10
>
>
>
>
>         --
>         View this message in context:
>         http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html
>         Sent from the Apache Spark User List mailing list archive at
>         Nabble.com.
>
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>         <mailto:user-unsubscribe@spark.apache.org>
>         For additional commands, e-mail: user-help@spark.apache.org
>         <mailto:user-help@spark.apache.org>
>
>
>


Mime
View raw message