spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dstuck <david.e.st...@gmail.com>
Subject DataFrame Distinct Sample Bug?
Date Tue, 03 Jan 2017 23:15:16 GMT
I ran into an issue where I'm getting unstable results after sampling a
dataframe that has had the distinct function called on it. The following
code should print different answers each time.

from pyspark.sql import functions as F
d = sqlContext.createDataFrame(sc.parallelize([[x] for x in range(100000)]),
['t'])
sampled = d.distinct().sample(False, 0.01, 478)
print sampled.select(F.min('t').alias('t')).collect()
print sampled.select(F.min('t').alias('t')).collect()
print sampled.select(F.min('t').alias('t')).collect()

Removing distinct and caching after sampling fix the problem (as does using
a smaller dataframe). The spark bug reporting docs dissuaded me from
creating a JIRA issue without checking with this mailing list that this is
reproducible.

I'm not familiar enough with the spark code to fix this :\



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-Distinct-Sample-Bug-tp20439.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message