spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Garren Staubli <gstau...@gmail.com>
Subject Re: [Pyspark, SQL] Very slow IN operator
Date Wed, 05 Apr 2017 22:41:39 GMT
Query building time is significant because it's a simple query but a long
one at almost 4,000 characters alone.

Task deserialization time takes up an inordinate amount of time (0.9s) when
I run your test and building the query itself is several seconds.

I would recommend using a JOIN (a broadcast join if your data set is small
enough) when the alternative is a massive IN statement.

On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers
List] <ml-node+s1001551n21307h26@n3.nabble.com> wrote:

> Hi,
> I'm trying to run queries with many values in IN operator.
>
> The result is that for more than 10K values IN operator is getting slower.
>
> For example this code is running about 20 seconds.
>
> df = spark.range(0,100000,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(100000))))).count()
>
> Any ideas how to improve this ?
> Is it a bug ?
> --
> Maciek Bryński
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=21307&i=0>
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=Z3N0YXVibGlAZ21haWwuY29tfDF8LTM1NDYzMTky>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Pyspark-SQL-Very-slow-IN-operator-tp21309.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Mime
View raw message