spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: NEW to spark and sparksql
Date Wed, 19 Nov 2014 21:45:28 GMT
In general you should be able to read full directories of files as a single
RDD/SchemaRDD.  For documentation I'd suggest the programming guides:

http://spark.apache.org/docs/latest/quick-start.html
http://spark.apache.org/docs/latest/sql-programming-guide.html

For Avro in particular, I have been working on a library for Spark SQL.
Its very early code, but you can find it here:
https://github.com/databricks/spark-avro

Bug reports welcome!

Michael

On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint <sam.flint@magnetic.com> wrote:

> Hi,
>
>     I am new to spark.  I have began to read to understand sparks RDD
> files as well as SparkSQL.  My question is more on how to build out the RDD
> files and best practices.   I have data that is broken down by hour into
> files on HDFS in avro format.   Do I need to create a separate RDD for each
> file? or using SparkSQL a separate SchemaRDD?
>
> I want to be able to pull lets say an entire day of data into spark and
> run some analytics on it.  Then possibly a week, a month, etc.
>
>
> If there is documentation on this procedure or best practives for building
> RDD's please point me at them.
>
> Thanks for your time,
>    Sam
>
>
>
>

Mime
View raw message