Hi Michael, 

I have got the directory based column support working at least in a trial. I have put the trial code here - DirIndexParquet.scala it has involved me copying quite a lot of newParquet. 

There are some tests here that parquet illustrate use.

I’d be keen to help in anyway with the datasources API changes that you mention, would you like to discuss?

Thanks

Mick



On 30 Dec 2014, at 17:40, Michael Davies <michael.belldavies@gmail.com> wrote:

Hi Michael, 

I’ve looked through the example and the test cases and I think I understand what we need to do - so I’ll give it a go. 

I think what I’d like to try to do is allow files to be added at anytime, so perhaps I can cache partition info, and also what may be useful for us would be to derive schema from the set of all files, hopefully this is achievable also.

Thanks

Mick


On 30 Dec 2014, at 04:49, Michael Armbrust <michael@databricks.com> wrote:

You can't do this now without writing a bunch of custom logic (see here for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)

I would like to make this easier as part of improvements to the datasources api that we are planning for Spark 1.3

On Mon, Dec 29, 2014 at 2:19 AM, Mickalas <Michael.BellDavies@gmail.com> wrote:
I see that there is already a request to add wildcard support to the
SQLContext.parquetFile function
https://issues.apache.org/jira/browse/SPARK-3928.

What seems like a useful thing for our use case is to associate the
directory structure with certain columns in the table, but it does not seem
like this is supported.

For example we want to create parquet files on a daily basis associated with
geographic regions and so will create a set of files under directories such
as:

* 2014-12-29/Americas
* 2014-12-29/Asia
* 2014-12-30/Americas
* ...

Where queries have predicates that match the column values determinable from
directory structure it would be good to only extract data from matching
files.

Does anyone know if something like this is supported, or whether this is a
reasonable thing to request?

Mick








--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-directory-structure-to-columns-in-SparkSQL-tp20880.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org