mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhaskar Ghosh <>
Subject Re: unknown test data twenty-newsgroups example
Date Fri, 01 Oct 2010 19:56:55 GMT
Thanks Ted, Robin, and Neil. I am now clear of my doubts, and would try the 
approach now.
Bhaskar Ghosh
Hyderabad, India

"Ignorance is Bliss... Knowledge never brings Peace!!!"

From: Ted Dunning <>
Cc: Bhaskar Ghosh <>;
Sent: Sat, 2 October, 2010 12:11:53 AM
Subject: Re: unknown test data twenty-newsgroups example

Yes.  Instance = training example.

Your method of duplicating lines is just what Robin meant.

On Fri, Oct 1, 2010 at 3:55 AM, Robin Anil <> wrote:

> Let me list what I understood. Pl confirm if I got it correct?
>> Add duplicate extra lines many times in an extra file (conforming to the
>> format required by the Bayes Classifier) in the format
>> <class-name1><tab><word1> <word2>
>> If I want to increase the weight of word1 and word2, so that text with
>> those words have higher chance of getting classified as <class-name1>
>> *
>> *
>No. Duplicating lines increases DF and therefore decreases (IDF == inverse
>document frequency) So weight goes down. To increase weight of the word
>repeat the word in the same line

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message