spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yana Kadiyska <yana.kadiy...@gmail.com>
Subject Re: Using Spark SQL with multiple (avro) files
Date Wed, 14 Jan 2015 15:20:15 GMT
If the wildcard path you have doesn't work you should probably open a bug
-- I had a similar problem with Parquet and it was a bug which recently got
closed. Not sure if sqlContext.avroFile shares a codepath with
.parquetFile...you
can try running with bits that have the fix for .parquetFile or look at the
source...
Here was my question for reference:
http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAAswR-5rFMU-y-7HtLUj2EQQaeCWJs8jH+iRrZHm7g1Ex7vSUw@mail.gmail.com%3E

On Wed, Jan 14, 2015 at 4:34 AM, David Jones <letsnumsperiods@gmail.com>
wrote:

> Hi,
>
> I have a program that loads a single avro file using spark SQL, queries
> it, transforms it and then outputs the data. The file is loaded with:
>
> val records = sqlContext.avroFile(filePath)
> val data = records.registerTempTable("data")
> ...
>
>
> Now I want to run it over tens of thousands of Avro files (all with
> schemas that contain the fields I'm interested in).
>
> Is it possible to load multiple avro files recursively from a top-level
> directory using wildcards? All my avro files are stored under
> s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
> these on EMR.
>
> If that's not possible, is there some way to load multiple avro files into
> the same table/RDD so the whole dataset can be processed (and in that case
> I'd supply paths to each file concretely, but I *really* don't want to have
> to do that).
>
> Thanks
> David
>

Mime
View raw message