spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdeali Kothari <abdealikoth...@gmail.com>
Subject Re: [pyspark 2.3+] CountDistinct
Date Tue, 02 Jul 2019 04:20:38 GMT
I can't exactly reproduce this. Here is what I tried quickly:

import uuid

import findspark
findspark.init()  # noqa
import pyspark
from pyspark.sql import functions as F  # noqa: N812

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    [str(uuid.uuid4()) for i in range(450000)],
], ['col1'])

print('>>>> Spark version:', spark.sparkContext.version)
print('>>>> Null count:', df.filter(F.col('col1').isNull()).count())
print('>>>> Value count:', df.filter(F.col('col1').isNotNull()).count())
print('>>>> Distinct Count 1:',
df.agg(F.countDistinct(F.col('col1'))).collect()[0][0])
print('>>>> Distinct Count 2:',
df.agg(F.countDistinct(F.col('col1'))).collect()[0][0])

This always returns:
>>>> Spark version: 2.4.0
>>>> Null count: 0
>>>> Value count: 450000
>>>> Distinct Count 1: 450000
>>>> Distinct Count 2: 450000




On Sat, Jun 29, 2019 at 6:51 PM Rishi Shah <rishishah.star@gmail.com> wrote:

> Thanks Abdeali! Please find details below:
>
> df.agg(countDistinct(col('col1'))).show() --> 450089
> df.agg(countDistinct(col('col1'))).show() --> 450076
> df.filter(col('col1').isNull()).count() --> 0
> df.filter(col('col1').isNotNull()).count() --> 450063
>
> col1 is a string
> Spark version 2.4.0
> datasize: ~ 500GB
>
>
> On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari <abdealikothari@gmail.com>
> wrote:
>
>> How large is the data frame and what data type are you counting distinct
>> for?
>> I use count distinct quite a bit and haven't noticed any thing peculiar.
>>
>> Also, which exact version in 2.3.x?
>> And, are performing any operations on the DF before the countDistinct?
>>
>> I recall there was a bug when I did countDistinct(PythonUDF(x)) in the
>> same query which was resolved in one of the minor versions in 2.3.x
>>
>> On Sat, Jun 29, 2019, 10:32 Rishi Shah <rishishah.star@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Just wanted to check in to see if anyone has any insight about this
>>> behavior. Any pointers would help.
>>>
>>> Thanks,
>>> Rishi
>>>
>>> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah <rishishah.star@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Recently we noticed that countDistinct on a larger dataframe doesn't
>>>> always return the same value. Any idea? If this is the case then what is
>>>> the difference between countDistinct & approx_count_distinct?
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Rishi Shah
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
> Regards,
>
> Rishi Shah
>

Mime
View raw message