spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean R. Owen (Jira)" <>
Subject [jira] [Resolved] (SPARK-26166) bug,training and validation dataset may overlap
Date Sat, 26 Oct 2019 23:48:00 GMT


Sean R. Owen resolved SPARK-26166.
    Resolution: Not A Problem

I would not consider that a bug, but indeed a subtle side effect of trying to random values
(which depend on data order and partitioning) in general in Spark. Indeed it can't be guaranteed
unless you materialize the dataframe, as you point out.

> bug,training and validation dataset may overlap
> --------------------------------------------------------------------
>                 Key: SPARK-26166
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Xinyong Tian
>            Priority: Major
> In the code, after adding random column
> df ="*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and validation
dataframe need to be created. The order of rows in df,which rand(seed)  is dependent on,
is not deterministic . Thus each time random column value could be different for a specific
row even with seed. Note , checkpoint() can not be replaced with cached(), because when a
node fails, cached table need be  recomputed, thus random number could be different.
> This might especially  be a problem when input 'dataset' dataframe is resulted from
a query including 'where' clause. see below.
> []

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message