spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin East <robin.e...@xense.co.uk>
Subject Re: Spark ML - Is IDF model reusable
Date Tue, 01 Nov 2016 11:24:37 GMT
Fit it on training data to evaluate the model. You can either use that model to apply to unseen
data or you can re-train (fit) on all your data before applying it to unseen data.

fit and transform are 2 different things: fit creates a model, transform applies a model to
data to create transformed output. If you are using your training data in a subsequent step
(e.g. running logistic regression or some other machine learning algorithm) then you need
to transform your training data using the IDF model before passing it through the next step.

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 1 Nov 2016, at 11:18, Nirav Patel <npatel@xactlycorp.com> wrote:
> 
> Just to re-iterate what you said, I should fit IDF model only on training data and then
re-use it for both test data and then later on unseen data to make predictions.
> 
> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.east@xense.co.uk <mailto:robin.east@xense.co.uk>>
wrote:
> The point of setting aside a portion of your data as a test set is to try and mimic applying
your model to unseen data. If you fit your IDF model to all your data, any evaluation you
perform on your test set is likely to over perform compared to ‘real’ unseen data. Effectively
you would have overfit your model.
> -------------------------------------------------------------------------------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>
> 
> 
> 
> 
> 
>> On 1 Nov 2016, at 10:15, Nirav Patel <npatel@xactlycorp.com <mailto:npatel@xactlycorp.com>>
wrote:
>> 
>> FYI, I do reuse IDF model while making prediction against new unlabeled data but
not between training and test data while training a model. 
>> 
>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npatel@xactlycorp.com <mailto:npatel@xactlycorp.com>>
wrote:
>> I am using IDF estimator/model (TF-IDF) to convert text features into vectors. Currently,
I fit IDF model on all sample data and then transform them. I read somewhere that I should
split my data into training and test before fitting IDF model; Fit IDF only on training data
and then use same transformer to transform training and test data. 
>> This raise more questions:
>> 1) Why would you do that? What exactly do IDF learn during fitting process that it
can reuse to transform any new dataset. Perhaps idea is to keep same value for |D| and DF|t,
D| while use new TF|t, D| ?
>> 2) If not then fitting and transforming seems redundant for IDF model
>> 
>> 
>> 
>> 
>>  <http://www.xactlycorp.com/email-click/>
>> 
>>  <https://www.nyse.com/quote/XNYS:XTLY>   <https://www.linkedin.com/company/xactly-corporation>
  <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   <http://www.youtube.com/xactlycorporation>
> 
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   <https://www.linkedin.com/company/xactly-corporation>
  <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   <http://www.youtube.com/xactlycorporation>

Mime
View raw message