[ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen resolved SPARK-26166.
----------------------------------
Resolution: Not A Problem
I would not consider that a bug, but indeed a subtle side effect of trying to random values
(which depend on data order and partitioning) in general in Spark. Indeed it can't be guaranteed
unless you materialize the dataframe, as you point out.
> CrossValidator.fit() bug,training and validation dataset may overlap
> --------------------------------------------------------------------
>
> Key: SPARK-26166
> URL: https://issues.apache.org/jira/browse/SPARK-26166
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.3.0
> Reporter: Xinyong Tian
> Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If df is not checkpointed, it will be recomputed each time when train and validation
dataframe need to be created. The order of rows in df,which rand(seed) is dependent on,
is not deterministic . Thus each time random column value could be different for a specific
row even with seed. Note , checkpoint() can not be replaced with cached(), because when a
node fails, cached table need be recomputed, thus random number could be different.
> This might especially be a problem when input 'dataset' dataframe is resulted from
a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org
|