spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Gilmore <dragoncu...@gmail.com>
Subject Parquet predicate pushdown
Date Mon, 05 Jan 2015 23:38:20 GMT
Hi all,

I have a question regarding predicate pushdown for Parquet.

My understanding was this would use the metadata in Parquet's blocks/pages
to skip entire chunks that won't match without needing to decode the values
and filter on every value in the table.

I was testing a scenario where I had 100M rows in a Parquet file.

Summing over a column took about 2-3 seconds.

I also have a column (e.g. customer ID) with approximately 100 unique
values.  My assumption, though not exactly linear, would be that filtering
on this would reduce the query time significantly due to it skipping entire
segments based on the metadata.

In fact, it took much longer - somewhere in the vicinity of 4-5 seconds,
which suggested to me it's reading all the values for the key column (100M
values), then filtering, then reading all the relevant segments/values for
the "measure" column, hence the increase in time.

In the logs, I could see it was successfully pushing down a Parquet
predicate, so I'm not sure I'm understanding why this is taking longer.

Could anyone shed some light on this or point out where I'm going wrong?
Thanks!

Mime
View raw message