spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean R. Owen (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap
Date Sat, 26 Oct 2019 23:48:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean R. Owen resolved SPARK-26166.
----------------------------------
    Resolution: Not A Problem

I would not consider that a bug, but indeed a subtle side effect of trying to random values
(which depend on data order and partitioning) in general in Spark. Indeed it can't be guaranteed
unless you materialize the dataframe, as you point out.

> CrossValidator.fit() bug,training and validation dataset may overlap
> --------------------------------------------------------------------
>
>                 Key: SPARK-26166
>                 URL: https://issues.apache.org/jira/browse/SPARK-26166
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Xinyong Tian
>            Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and validation
dataframe need to be created. The order of rows in df,which rand(seed)  is dependent on,
is not deterministic . Thus each time random column value could be different for a specific
row even with seed. Note , checkpoint() can not be replaced with cached(), because when a
node fails, cached table need be  recomputed, thus random number could be different.
> This might especially  be a problem when input 'dataset' dataframe is resulted from
a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message