spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdeali Kothari <abdealikoth...@gmail.com>
Subject Re: [pyspark 2.3+] CountDistinct
Date Sat, 29 Jun 2019 09:33:39 GMT
How large is the data frame and what data type are you counting distinct
for?
I use count distinct quite a bit and haven't noticed any thing peculiar.

Also, which exact version in 2.3.x?
And, are performing any operations on the DF before the countDistinct?

I recall there was a bug when I did countDistinct(PythonUDF(x)) in the same
query which was resolved in one of the minor versions in 2.3.x

On Sat, Jun 29, 2019, 10:32 Rishi Shah <rishishah.star@gmail.com> wrote:

> Hi All,
>
> Just wanted to check in to see if anyone has any insight about this
> behavior. Any pointers would help.
>
> Thanks,
> Rishi
>
> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah <rishishah.star@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Recently we noticed that countDistinct on a larger dataframe doesn't
>> always return the same value. Any idea? If this is the case then what is
>> the difference between countDistinct & approx_count_distinct?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>

Mime
View raw message