spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anton Baranau (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-18748) UDF multiple evaluations causes very poor performance
Date Wed, 02 Oct 2019 21:33:03 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943176#comment-16943176
] 

Anton Baranau edited comment on SPARK-18748 at 10/2/19 9:32 PM:
----------------------------------------------------------------

I got the same problem having the code below with 2.4.4
{code:python}
df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1',
'scores.score2')
{code}
In my case data isn't huge so I can afford to cache it like below
{code:python}
df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1',
'scores.score2')
{code}


was (Author: ahtokca):
I got the same problem having the code below with 2.4.4
{code:python}
df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1',
'scores.score2')
{code}
In my case data isn't huge so I can afford to cahce it like below
{code:python}
df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1',
'scores.score2')
{code}

> UDF multiple evaluations causes very poor performance
> -----------------------------------------------------
>
>                 Key: SPARK-18748
>                 URL: https://issues.apache.org/jira/browse/SPARK-18748
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Ohad Raviv
>            Priority: Major
>
> We have a use case where we have a relatively expensive UDF that needs to be calculated.
The problem is that instead of being calculated once, it gets calculated over and over again.
> for example:
> {quote}
> def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\}
> hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _)
> hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is not null
and c<>''").show
> {quote}
> with the output:
> {quote}
> blahblah1
> blahblah1
> blahblah1
> +-------+
> |      c|
> +-------+
> |nothing|
> +-------+
> {quote}
> You can see that for each reference of column "c" you will get the println.
> that causes very poor performance for our real use case.
> This also came out on StackOverflow:
> http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns
> http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/
> with two problematic work-arounds:
> 1. cache() after the first time. e.g.
> {quote}
> hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not null and
c<>''").show
> {quote}
> while it works, in our case we can't do that because the table is too big to cache.
> 2. move back and forth to rdd:
> {quote}
> val df = hiveContext.sql("select veryExpensiveCalc('a') as c")
> hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and c<>''").show
> {quote}
> which works but then we loose some of the optimizations like push down predicate features,
etc. and its very ugly.
> Any ideas on how we can make the UDF get calculated just once in a reasonable way?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message