spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Joshi <sanjos...@gmail.com>
Subject SQL predicate pushdown on parquet or other columnar formats
Date Mon, 01 Aug 2016 18:17:12 GMT
Hi

I just want to confirm my understanding of the physical plan generated by
Spark SQL while reading from a Parquet file.

When multiple predicates are pushed to the PrunedFilterScan, does Spark
ensure that the Parquet file is not read multiple times while evaluating
each predicate ?

In general, is this optimization done for all columnar databases or file
formats ?

When I ran the following query in the spark-shell

> val nameDF = sqlContext.sql("SELECT name FROM parquetFile WHERE age = 50
AND name = 'someone'")

I saw that both the filters are pushed, but I can't seem to find where it
applies them to the file data.

> nameDF.explain()

shows

Project [name#112]
+- Filter ((age#111L = 50) && (name#112 = someone))
   +- Scan ParquetRelation[name#112,age#111L] InputPaths:
file:/home/spark/spark-1.6.1/people.parquet,
      PushedFilters: [EqualTo(age,50), EqualTo(name,someone)]

Mime
View raw message