spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Chen <wei.chen.ri...@gmail.com>
Subject optimal way to load parquet files with partition
Date Tue, 02 Feb 2016 18:07:42 GMT
Hi All,

I have data partitioned by year=yyyy/month=mm/day=dd, what is the best way
to get two months of data from a given year (let's say June and July)?

Two ways I can think of:
1. use unionAll
df1 = sqc.read.parquet('xxx/year=2015/month=6')
df2 = sqc.read.parquet('xxx/year=2015/month=7')
df = df1.unionAll(df2)

2. use filter after load the whole year
df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')

Which of the above is better? Or are there better ways to handle this?


Thank you,
Wei

Mime
View raw message