spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com.INVALID>
Subject Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.
Date Mon, 03 Aug 2020 14:40:49 GMT
If you use pandas_udfs in 2.4 they should be quite performant (or at least
won't suffer serialization overhead), might be worth looking into.

I didn't run your code but one consideration is that the while loop might
be making the DAG a lot bigger than it has to be. You might see if defining
those columns with list comprehensions forming a single select() statement
makes for a smaller DAG.

On Mon, Aug 3, 2020 at 10:06 AM Henrique Oliveira <hesouol@gmail.com> wrote:

> Hi Patrick, thank you for your quick response.
> That's exactly what I think. Actually, the result of this processing is an
> intermediate table that is going to be used for other views generation.
> Another approach I'm trying now, is to move the "explosion" step for this
> "view generation" step, this way I don't need to explode every column but
> just those used for the final client.
>
> ps.I was avoiding UDFs for now because I'm still on Spark 2.4 and the
> python udfs I tried had very bad performance, but I will give it a try in
> this case. It can't be worse.
> Thanks again!
>
> Em seg., 3 de ago. de 2020 às 10:53, Patrick McCarthy <
> pmccarthy@dstillery.com> escreveu:
>
>> This seems like a very expensive operation. Why do you want to write out
>> all the exploded values? If you just want all combinations of values, could
>> you instead do it at read-time with a UDF or something?
>>
>> On Sat, Aug 1, 2020 at 8:34 PM hesouol <hesouol@gmail.com> wrote:
>>
>>> I forgot to add an information. By "can't write" I mean it keeps
>>> processing
>>> and nothing happens. The job runs for hours even with a very small file
>>> and
>>> I have to force the stoppage.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Mime
View raw message