mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chandra Mohan, Ananda Vel Murugan" <>
Subject RE: Feature vector generation from Bag-of-Words
Date Tue, 18 Jun 2013 13:00:12 GMT

Thanks. It did help. 


-----Original Message-----
From: Suneel Marthi [] 
Sent: Tuesday, June 18, 2013 10:55 AM
To: Chandra Mohan, Ananda Vel Murugan;
Subject: Re: Feature vector generation from Bag-of-Words

Check this link -

 From: "Chandra Mohan, Ananda Vel Murugan" <>
To: "" <>; Suneel Marthi <>

Sent: Tuesday, June 18, 2013 12:59 AM
Subject: RE: Feature vector generation from Bag-of-Words


I am implementing slightly different variation of this solution. I need some guidance. 

I have a CSV file with two columns, REMARKS and CATEGORY. Based on the remarks, I train naïve
bayes model which would automatically assign categories to REMARKS. I followed this link
It works fine. 

Now I have a slightly different requirement. I want the text in REMARKS column to be tokenized
in a different fashion. I have some keywords. When those keywords occur in REMARKS text, I
want them to be intact and splitted further. For example, if REMARKS text is "Sump pressure
is low", with default analyzer, it would be split into four tokens as "Sump", "pressure",
"is", "low". But I want it to be tokenized as "Sump pressure", "is", "low". I have implemented
a custom tokenizer which would do this. 

Now I want to vectorize this. I tried the pseudo code suggested below. I don't know how to
serialize these vectors into sequence files. When I run seq2sparse, apart from vectors, it
creates some other labelindex and dictionary files. I could not see the code to create those
files here. Am I missing something? I have started looking into org.apache.mahout.vectorizer
package. Any pointers would be of great help. 


-----Original Message-----
From: Suneel Marthi [] 
Sent: Tuesday, May 21, 2013 10:21 PM
Subject: Re: Feature vector generation from Bag-of-Words

It should be easy to convert the below pseudocode to MapReduce to scale for large collection
of documents.

From: Suneel Marthi <>
To: "" <> 
Sent: Tuesday, May 21, 2013 12:20 PM
Subject: Re: Feature vector generation from Bag-of-Words


Here's how I would do it.

1.  Create a collection of the 100 keywords that r of interest.

     Collection<String> keywords = new ArrayList<String>();
     keywords.addAll(<your 100 keywords>);

2.  For each word in each of the text documents create a Multiset (which is a bag of words)
      retain only those terms of interest from (1) that are of interest and use Mahout's

     // Itertate through all the documents
     for document in documents {

      //create a bag of words for each document
       Multiset<String> multiset = new HashMultiset<String>();

     // create a RandomAccessSparseVector
     Vector v = new RandomAccessSparseVector(100); // 100 features for the 100 keywords

        for term in document.terms {

        // retain only those keywords that are of interest (from step 1)

       // You now have a bag of words containing only the keywords with their term frequencies
      // Use one of the Feature Encoders, refer to Section 14.3 of Mahout in Action for
more detailed description of
      // this process

       FeatureVectorEncoder encoder = new StaticWordValueEncoder("body");
     for (Multiset.Entry<String> entry : multiset.entrySet()) {
       encoder.addToVector(entry.getElement(), entry.getCount(), v);



From: Stuti Awasthi <>
To: "" <> 
Sent: Tuesday, May 21, 2013 7:17 AM
Subject: Feature vector generation from Bag-of-Words

Hi all,

I have a query regarding the Feature Vector generation for Text documents.
I have read Mahout in Action and understood how to create the text document in feature vector
weighed by Tf of Tfidf schemes. My usecase is a little tweaked with that.

I have few keywords may be say 100 and I want to create the Feature Vector of the text documents
only with these 100 keywords. So I would like to calculate the frequency of each keyword in
each document and generate the feature vector of the keyword with the frequency as weights.

Is there any already present way to do this or Il need to write the custom code?

Stuti Awasthi


The contents of this e-mail and any attachment(s) are confidential and intended for the named
recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e
mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator
or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may
not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying,
disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized
representative of
HCL is strictly prohibited. If you have received this email in error please delete it and
notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.


View raw message