spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Using Spark SQL with multiple (avro) files
Date Fri, 16 Jan 2015 21:54:22 GMT
I'd open an issue on the github to ask us to allow you to use hadoops glob
file format for the path.

On Thu, Jan 15, 2015 at 4:57 AM, David Jones <letsnumsperiods@gmail.com>
wrote:

> I've tried this now. Spark can load multiple avro files from the same
> directory by passing a path to a directory. However, passing multiple paths
> separated with commas didn't work.
>
>
> Is there any way to load all avro files in multiple directories using
> sqlContext.avroFile?
>
> On Wed, Jan 14, 2015 at 3:53 PM, David Jones <letsnumsperiods@gmail.com>
> wrote:
>
>> Should I be able to pass multiple paths separated by commas? I haven't
>> tried but didn't think it'd work. I'd expected a function that accepted a
>> list of strings.
>>
>> On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska <yana.kadiyska@gmail.com>
>> wrote:
>>
>>> If the wildcard path you have doesn't work you should probably open a
>>> bug -- I had a similar problem with Parquet and it was a bug which recently
>>> got closed. Not sure if sqlContext.avroFile shares a codepath with .parquetFile...you
>>> can try running with bits that have the fix for .parquetFile or look at the
>>> source...
>>> Here was my question for reference:
>>>
>>> http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAAswR-5rFMU-y-7HtLUj2EQQaeCWJs8jH+iRrZHm7g1Ex7vSUw@mail.gmail.com%3E
>>>
>>> On Wed, Jan 14, 2015 at 4:34 AM, David Jones <letsnumsperiods@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a program that loads a single avro file using spark SQL, queries
>>>> it, transforms it and then outputs the data. The file is loaded with:
>>>>
>>>> val records = sqlContext.avroFile(filePath)
>>>> val data = records.registerTempTable("data")
>>>> ...
>>>>
>>>>
>>>> Now I want to run it over tens of thousands of Avro files (all with
>>>> schemas that contain the fields I'm interested in).
>>>>
>>>> Is it possible to load multiple avro files recursively from a top-level
>>>> directory using wildcards? All my avro files are stored under
>>>> s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all
of
>>>> these on EMR.
>>>>
>>>> If that's not possible, is there some way to load multiple avro files
>>>> into the same table/RDD so the whole dataset can be processed (and in that
>>>> case I'd supply paths to each file concretely, but I *really* don't want
to
>>>> have to do that).
>>>>
>>>> Thanks
>>>> David
>>>>
>>>
>>>
>>
>

Mime
View raw message