spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: Can Spark benefit from Hive-like partitions?
Date Mon, 26 Jan 2015 16:06:46 GMT
You can create a partitioned hive table using Spark SQL:

On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates <> wrote:

> Hi,
> I've got a bunch of data stored in S3 under directories like this:
> s3n://blah/y=2015/m=01/d=25/lots-of-files.csv
> In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that
> it only scans the necessary directories for files to read.
> As far as I can tell from searching and reading the docs, the right way of
> loading this data into Spark is to use sc.textFile("s3n://blah/*/*/*/")
> 1) Is there any way in Spark to access y, m and d as fields? In Hive, you
> declare them in the schema, but you don't put them in the CSV files - their
> values are extracted from the path.
> 2) Is there any way to get Spark to use the y, m and d fields to minimise
> the files it transfers from S3?
> Thanks,
> Danny.

View raw message