spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Jones <>
Subject Re: Using Spark SQL with multiple (avro) files
Date Wed, 14 Jan 2015 15:53:15 GMT
Should I be able to pass multiple paths separated by commas? I haven't
tried but didn't think it'd work. I'd expected a function that accepted a
list of strings.

On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska <>

> If the wildcard path you have doesn't work you should probably open a bug
> -- I had a similar problem with Parquet and it was a bug which recently got
> closed. Not sure if sqlContext.avroFile shares a codepath with
> can try running with bits that have the fix for .parquetFile or look at the
> source...
> Here was my question for reference:
> On Wed, Jan 14, 2015 at 4:34 AM, David Jones <>
> wrote:
>> Hi,
>> I have a program that loads a single avro file using spark SQL, queries
>> it, transforms it and then outputs the data. The file is loaded with:
>> val records = sqlContext.avroFile(filePath)
>> val data = records.registerTempTable("data")
>> ...
>> Now I want to run it over tens of thousands of Avro files (all with
>> schemas that contain the fields I'm interested in).
>> Is it possible to load multiple avro files recursively from a top-level
>> directory using wildcards? All my avro files are stored under
>> s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
>> these on EMR.
>> If that's not possible, is there some way to load multiple avro files
>> into the same table/RDD so the whole dataset can be processed (and in that
>> case I'd supply paths to each file concretely, but I *really* don't want to
>> have to do that).
>> Thanks
>> David

View raw message