spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Spark ML - Is it rule of thumb that all Estimators should only be Fit on Training data
Date Wed, 02 Nov 2016 18:11:33 GMT
I would also only fit these on training data. There are probably some
corner cases where letting these ancillary transforms see test data results
in a target leak. Though I can't really think of a good example.

More to the point, you're probably fitting these as part of a pipeline and
that pipeline as a whole is only fed with training data during model

On Wed, Nov 2, 2016 at 6:05 PM Nirav Patel <> wrote:

> It is very clear that for ML algorithms (classification, regression) that
> Estimator only fits on training data but it's not very clear of other
> estimators like IDF for example.
> IDF is a feature transformation model but having IDF estimator and
> transformer makes it little confusing that what exactly it does in Fitting
> on one dataset vs Transforming on another dataset.
> [image: What's New with Xactly] <>
> <>  [image: LinkedIn]
> <>  [image: Twitter]
> <>  [image: Facebook]
> <>  [image: YouTube]
> <>

View raw message