spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <jlalw...@amazon.com.INVALID>
Subject Re: Count distinct and driver memory
Date Mon, 19 Oct 2020 18:02:36 GMT
I was caching it because I didn't want to re-execute the DAG when I ran the count query. If
you have a spark application with multiple actions, Spark reexecutes the entire DAG for each
action unless there is a cache in between. I was trying to avoid reloading 1/2 a terabyte
of data.  Also, cache should use up executor memory, not driver memory.

As it turns out cache was the problem. I didn't expect cache to take Executor memory and spill
over to disk. I don't know why it's taking driver memory. The input data has millions of partitions
which results in millions of tasks. Perhaps the high memory usage is a side effect of caching
the results of lots of tasks. 

On 10/19/20, 1:27 PM, "Nicolas Paris" <nicolas.paris@riseup.net> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or
open attachments unless you can confirm the sender and know the content is safe.



    > Before I write the data frame to parquet, I do df.cache. After writing
    > the file out, I do df.countDistinct(“a”, “b”, “c”).collect()
    if you write the df to parquet, why would you also cache it ? caching by
    default loads the memory. this might affect  later use, such
    collect. the resulting GC can be explained by both caching and collect


    Lalwani, Jayesh <jlalwani@amazon.com.INVALID> writes:

    > I have a Dataframe with around 6 billion rows, and about 20 columns. First of all,
I want to write this dataframe out to parquet. The, Out of the 20 columns, I have 3 columns
of interest, and I want to find how many distinct values of the columns are there in the file.
I don’t need the actual distinct values. I just need the count. I knoe that there are around
10-16million distinct values
    >
    > Before I write the data frame to parquet, I do df.cache. After writing the file out,
I do df.countDistinct(“a”, “b”, “c”).collect()
    >
    > When I run this, I see that the memory usage on my driver steadily increases until
it starts getting future time outs. I guess it’s spending time in GC. Does countDistinct
cause this behavior? Does Spark try to get all 10 million distinct values into the driver?
Is countDistinct not recommended for data frames with large number of distinct values?
    >
    > What’s the solution? Should I use approx._count_distinct?


    --
    nicolas paris

    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message