spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: Given multiple .filter()'s, is there a way to set the order?
Date Fri, 14 Nov 2014 19:02:05 GMT
In the situation you show, Spark will pipeline each filter together, and
will apply each filter one at a time to each row, effectively constructing
an "&&" statement. You would only see a performance difference if the
filter code itself is somewhat expensive, then you would want to only
execute it on a smaller set of rows. Otherwise, the runtime difference
between "a == b && b == c && c ==d" is minimal when compared to "a == b &
b
== c & c == d", the latter being sort of the worst-case scenario as it
would always run all filters (though as I said, Spark acts like the former).

Spark does not reorder the filters automatically. It uses the explicit
ordering you provide.

On Fri, Nov 14, 2014 at 10:20 AM, YaoPau <jonrgregg@gmail.com> wrote:

> I have an RDD "x" of millions of STRINGs, each of which I want to pass
> through a set of filters.  My filtering code looks like this:
>
> x.filter(filter#1, which will filter out 40% of data).
>    filter(filter#2, which will filter out 20% of data).
>    filter(filter#3, which will filter out 2% of data).
>    filter(filter#4, which will filter out 1% of data)
>
> There is no ordering requirement (filter #2 does not depend on filter #1,
> etc), but the filters are drastically different in the % of rows they
> should
> eliminate.  What I'd like is an ordering similar to a "||" statement, where
> if it fails on filter#1 the row automatically gets filtered out before the
> other three filters run.
>
> But when I play around with the ordering of the filters, the runtime
> doesn't
> seem to change.  Is Spark somehow intelligently guessing how effective each
> filter will be and ordering it correctly regardless of how I order them?
> If
> not, is there I way I can set the filter order?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Given-multiple-filter-s-is-there-a-way-to-set-the-order-tp18957.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message