spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikodimos Nikolaidis <niknik...@csd.auth.gr>
Subject Time delay in multiple predicate Filter
Date Fri, 16 Mar 2018 13:27:24 GMT
Hello,

Here’s a behavior that I find strange. A filtering with a predicate with 
zero selectivity is much quicker than a filtering with multiple 
predicates, but with the same zero-selectivity predicate in first place.

For example, in google’s English One Million 1-grams dataset (Spark 2.2, 
wholeStageCodegen enabled, local mode):

|val df = spark.read.parquet("../../1-grams.parquet") def metric[T](f: => 
T): T = { val t0 = System.nanoTime() val res = f println("Executed in " 
+ (System.nanoTime()-t0)/1000000 + "ms") res } df.count() // res11: Long 
= 261823186 metric { df.filter('gram === "does not exist").count() } // 
Executed in 1794ms // res13: Long = 0 metric { df.filter('gram === "does 
not exist" && 'year > 0 && 'times > 0 && 'books > 0).count()
} // 
Executed in 4233ms // res15: Long = 0 |

In generated code, the behavior is exact the same; first predicate will 
continue the loop in the same point. Why this time difference exists?

Thanks

​

Mime
View raw message