mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alok Tanna <tannaa...@gmail.com>
Subject Re: Mahout : 20-newsgroups Classification Example : Split command
Date Thu, 14 Jan 2016 22:00:32 GMT
Thank you Andrew for your inputs. I will try the example in Scala .

So this example of 20-newsgroup cannot be used with other data sets to test
it once the split is done , is that right ?

Thanks,
Alok Tanna

On Thu, Jan 14, 2016 at 4:26 PM, Andrew Palumbo <ap.dev@outlook.com> wrote:

> The poor results you are seeing by testing are because you've run
> seq2sparse on each set independently.   This will create two different
> dictionaries, which serve as the vector index for each term in your
> vocabulary.  You must use the same dictionary that you trained your model
> on to vectorize your holdout set.  There is an example for doing this in
> Scala, using the new Mahout Samsara environment here:
>
>
> http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
>
> See the "Define a function to tokenize and vectorize new text using our
> current dictionary" section.
>
>
>
> ________________________________________
> From: Alok Tanna <tannaalok@gmail.com>
> Sent: Thursday, January 14, 2016 2:31 PM
> To: user@mahout.apache.org
> Subject: Mahout : 20-newsgroups Classification Example : Split command
>
> Hi ,
>
> This request is in referece to the 20-newsgroups Classification Example on
> the below link
> https://mahout.apache.org/users/classification/twenty-newsgroups.html
>
> I am able to run the example and get the results as mentioned in the link,
> but when I am trying to do this example without the split command the
> results are not same. Also when I try to run the other test data against
> the same model results are not accurate.
>
> Can we have this example run without the split command ?
>
> Basically I am trying to do this :
>
> I took both the datasets for training & testing.
>
> Run below commands on both sets:
> 1. seqdirectory
> 2. seq2sparse
>
> Now I  have vectors generated for both datasets.
> - Run trainnb command using first dataset's vectors output. So instead of
> training a model on 80% of the data, I am  using the whole dataset.
> - Run testnb command using second dataset's vectors output. This is not the
> 20% of the data, it's completely new dataset, solely used for testing.
>
> So instead of using mahout split, we I have specified separate dataset for
> testing the model.
>
> Results for this exercise is totally different then what I get when I am
> using split command to split the data .
>
>
> Thanks & Regards,
>
> Alok R. Tanna
>



-- 
Thanks & Regards,

Alok R. Tanna

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message