spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Sannier (JIRA)" <>
Subject [jira] [Reopened] (SPARK-23709) BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights
Date Mon, 14 May 2018 15:46:00 GMT


Olivier Sannier reopened SPARK-23709:

I'm sorry, but I already did ask the question on the mailing list, and did not receive any

I waited a couple of weeks before creating the issue here.

And to me, this is an issue, at least with documentation because the theory dictates that
all datasets are the same size, which they are not with the current implementation.

> BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights
> -------------------------------------------------------------------------------------------
>                 Key: SPARK-23709
>                 URL:
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Olivier Sannier
>            Priority: Critical
> When using a bagging method like RandomForest, the theory dictates that the source dataset
is copied over with a subsample of rows.
> To avoid excessive memory usage, Spark uses the BaggedPoint concept where each row is
associated to a weight for the final dataset, ie for each tree asked for the RandomForest.
> RandomForest requires that the dataset for each tree is a random draw with replacement
from the source data, that has the same size as the source data.
> However, during investigations, we found out that the count value used to compute the
variance is not always equal to the source data count, it is sometimes less, sometimes more.
> I went digging in the source and found the BaggedPoint.convertToBaggedRDDSamplingWithReplacement
method which uses a Poisson distribution to assign a weight to each row. And this distribution
does not guarantee that the total of weights for a given tree is equal to the source dataset
> Looking around in here, it seems this is done for performance reasons because the approximation
it gives is good enough, especially when dealing with very large datasets.
> However, I could not find any documentation that clearly explains this. Would you have
any link on the subject?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message