spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Vanzin (JIRA)" <>
Subject [jira] [Resolved] (SPARK-23709) BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights
Date Wed, 09 May 2018 20:02:00 GMT


Marcelo Vanzin resolved SPARK-23709.
    Resolution: Information Provided

Please use the mailing lists to ask questions.

> BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights
> -------------------------------------------------------------------------------------------
>                 Key: SPARK-23709
>                 URL:
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Olivier Sannier
>            Priority: Critical
> When using a bagging method like RandomForest, the theory dictates that the source dataset
is copied over with a subsample of rows.
> To avoid excessive memory usage, Spark uses the BaggedPoint concept where each row is
associated to a weight for the final dataset, ie for each tree asked for the RandomForest.
> RandomForest requires that the dataset for each tree is a random draw with replacement
from the source data, that has the same size as the source data.
> However, during investigations, we found out that the count value used to compute the
variance is not always equal to the source data count, it is sometimes less, sometimes more.
> I went digging in the source and found the BaggedPoint.convertToBaggedRDDSamplingWithReplacement
method which uses a Poisson distribution to assign a weight to each row. And this distribution
does not guarantee that the total of weights for a given tree is equal to the source dataset
> Looking around in here, it seems this is done for performance reasons because the approximation
it gives is good enough, especially when dealing with very large datasets.
> However, I could not find any documentation that clearly explains this. Would you have
any link on the subject?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message