spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dstuck <>
Subject DataFrame Distinct Sample Bug?
Date Tue, 03 Jan 2017 23:15:16 GMT
I ran into an issue where I'm getting unstable results after sampling a
dataframe that has had the distinct function called on it. The following
code should print different answers each time.

from pyspark.sql import functions as F
d = sqlContext.createDataFrame(sc.parallelize([[x] for x in range(100000)]),
sampled = d.distinct().sample(False, 0.01, 478)

Removing distinct and caching after sampling fix the problem (as does using
a smaller dataframe). The spark bug reporting docs dissuaded me from
creating a JIRA issue without checking with this mailing list that this is

I'm not familiar enough with the spark code to fix this :\

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe e-mail:

View raw message