spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nirav Patel <npa...@xactlycorp.com>
Subject Re: Spark ML - Is IDF model reusable
Date Tue, 01 Nov 2016 11:54:08 GMT
Yes, I do apply NaiveBayes after IDF .

" you can re-train (fit) on all your data before applying it to unseen
data." Did you mean I can reuse that model to Transform both training and
test data?

Here's the process:

Datasets:

   1. Full sample data (labeled)
   2. Training (labeled)
   3. Test (labeled)
   4. Unseen (non-labeled)

Here are two workflow options I see:

Option - 1 (currently using)

   1. Fit IDF model (idf-1) on full Sample data
   2. Apply(Transform) idf-1 on full sample data
   3. Split data set into Training and Test data
   4. Fit ML model on Training data
   5. Apply(Transform) model on Test data
   6. Apply(Transform) idf-1 on Unseen data
   7. Apply(Transform) model on Unseen data

Option - 2

   1. Split sample data into Training and Test data
   2. Fit IDF model (idf-1) only on training data
   3. Apply(Transform) idf-1 on training data
   4. Apply(Transform) idf-1 on test data
   5. Fit ML model on Training data
   6. Apply(Transform) model on Test data
   7. Apply(Transform) idf-1 on Unseen data
   8. Apply(Transform) model on Unseen data

So you are suggesting Option-2 in this particular case, right?

On Tue, Nov 1, 2016 at 4:24 AM, Robin East <robin.east@xense.co.uk> wrote:

> Fit it on training data to evaluate the model. You can either use that
> model to apply to unseen data or you can re-train (fit) on all your data
> before applying it to unseen data.
>
> fit and transform are 2 different things: fit creates a model, transform
> applies a model to data to create transformed output. If you are using your
> training data in a subsequent step (e.g. running logistic regression or
> some other machine learning algorithm) then you need to transform your
> training data using the IDF model before passing it through the next step.
>
> ------------------------------------------------------------
> -------------------
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 1 Nov 2016, at 11:18, Nirav Patel <npatel@xactlycorp.com> wrote:
>
> Just to re-iterate what you said, I should fit IDF model only on training
> data and then re-use it for both test data and then later on unseen data to
> make predictions.
>
> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.east@xense.co.uk> wrote:
>
>> The point of setting aside a portion of your data as a test set is to try
>> and mimic applying your model to unseen data. If you fit your IDF model to
>> all your data, any evaluation you perform on your test set is likely to
>> over perform compared to ‘real’ unseen data. Effectively you would have
>> overfit your model.
>> ------------------------------------------------------------
>> -------------------
>> Robin East
>> *Spark GraphX in Action* Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>>
>>
>>
>>
>>
>> On 1 Nov 2016, at 10:15, Nirav Patel <npatel@xactlycorp.com> wrote:
>>
>> FYI, I do reuse IDF model while making prediction against new unlabeled
>> data but not between training and test data while training a model.
>>
>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npatel@xactlycorp.com>
>> wrote:
>>
>>> I am using IDF estimator/model (TF-IDF) to convert text features into
>>> vectors. Currently, I fit IDF model on all sample data and then transform
>>> them. I read somewhere that I should split my data into training and test
>>> before fitting IDF model; Fit IDF only on training data and then use same
>>> transformer to transform training and test data.
>>> This raise more questions:
>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>
>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Mime
View raw message