spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From thomas j <beanb...@googlemail.com>
Subject Querying over mutliple (avro) files using Spark SQL
Date Tue, 13 Jan 2015 17:25:15 GMT
Hi,

I have a program that loads a single avro file using spark SQL, queries it,
transforms it and then outputs the data. The file is loaded with:

val records = sqlContext.avroFile(filePath)
val data = records.registerTempTable("data")
...


Now I want to run it over tens of thousands of Avro files (all with schemas
that contain the fields I'm interested in).

Is it possible to load multiple avro files recursively from a top-level
directory using wildcards? All my avro files are stored under
s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
these.

If that's not possible, is there some way to load multiple avro files into
the same table/RDD so the whole dataset can be processed (and in that case
I'd supply paths to each file concretely, but I *really* don't want to have
to do that).

Thanks

Mime
View raw message