spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Parquet predicate pushdown
Date Tue, 06 Jan 2015 02:17:39 GMT
Predicate push down into the input format is turned off by default because
there is a bug in the current parquet library that null pointers when there
are full row groups that are null.

https://issues.apache.org/jira/browse/SPARK-4258

You can turn it on if you want:
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

On Mon, Jan 5, 2015 at 3:38 PM, Adam Gilmore <dragoncurve@gmail.com> wrote:

> Hi all,
>
> I have a question regarding predicate pushdown for Parquet.
>
> My understanding was this would use the metadata in Parquet's blocks/pages
> to skip entire chunks that won't match without needing to decode the values
> and filter on every value in the table.
>
> I was testing a scenario where I had 100M rows in a Parquet file.
>
> Summing over a column took about 2-3 seconds.
>
> I also have a column (e.g. customer ID) with approximately 100 unique
> values.  My assumption, though not exactly linear, would be that filtering
> on this would reduce the query time significantly due to it skipping entire
> segments based on the metadata.
>
> In fact, it took much longer - somewhere in the vicinity of 4-5 seconds,
> which suggested to me it's reading all the values for the key column (100M
> values), then filtering, then reading all the relevant segments/values for
> the "measure" column, hence the increase in time.
>
> In the logs, I could see it was successfully pushing down a Parquet
> predicate, so I'm not sure I'm understanding why this is taking longer.
>
> Could anyone shed some light on this or point out where I'm going wrong?
> Thanks!
>

Mime
View raw message