spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Spark ML - Is it rule of thumb that all Estimators should only be Fit on Training data
Date Wed, 02 Nov 2016 18:11:33 GMT
I would also only fit these on training data. There are probably some
corner cases where letting these ancillary transforms see test data results
in a target leak. Though I can't really think of a good example.

More to the point, you're probably fitting these as part of a pipeline and
that pipeline as a whole is only fed with training data during model
building.

On Wed, Nov 2, 2016 at 6:05 PM Nirav Patel <npatel@xactlycorp.com> wrote:

> It is very clear that for ML algorithms (classification, regression) that
> Estimator only fits on training data but it's not very clear of other
> estimators like IDF for example.
> IDF is a feature transformation model but having IDF estimator and
> transformer makes it little confusing that what exactly it does in Fitting
> on one dataset vs Transforming on another dataset.
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>

Mime
View raw message