spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <>
Subject Count distinct and driver memory
Date Mon, 19 Oct 2020 03:23:33 GMT
I have a Dataframe with around 6 billion rows, and about 20 columns. First of all, I want to
write this dataframe out to parquet. The, Out of the 20 columns, I have 3 columns of interest,
and I want to find how many distinct values of the columns are there in the file. I don’t
need the actual distinct values. I just need the count. I knoe that there are around 10-16million
distinct values

Before I write the data frame to parquet, I do df.cache. After writing the file out, I do
df.countDistinct(“a”, “b”, “c”).collect()

When I run this, I see that the memory usage on my driver steadily increases until it starts
getting future time outs. I guess it’s spending time in GC. Does countDistinct cause this
behavior? Does Spark try to get all 10 million distinct values into the driver? Is countDistinct
not recommended for data frames with large number of distinct values?

What’s the solution? Should I use approx._count_distinct?
View raw message