spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Jones <>
Subject Using Spark SQL with multiple (avro) files
Date Wed, 14 Jan 2015 09:34:59 GMT

I have a program that loads a single avro file using spark SQL, queries it,
transforms it and then outputs the data. The file is loaded with:

val records = sqlContext.avroFile(filePath)
val data = records.registerTempTable("data")

Now I want to run it over tens of thousands of Avro files (all with schemas
that contain the fields I'm interested in).

Is it possible to load multiple avro files recursively from a top-level
directory using wildcards? All my avro files are stored under
s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
these on EMR.

If that's not possible, is there some way to load multiple avro files into
the same table/RDD so the whole dataset can be processed (and in that case
I'd supply paths to each file concretely, but I *really* don't want to have
to do that).


View raw message