mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Junaid Surve <>
Subject Mahout's Text Similarity using HBase
Date Mon, 21 May 2012 14:12:38 GMT

In my Project we are trying to calculate the Text Similarity of a set of
documents for which I am facing 2 issues.


   I do not want to recalculate the Term Frequency of the documents I have
   previously calculated. e.g. I have 10 docs and I have calculated the Term
   Frequency and Inverse Document Frequency for all the 10 documents. Then I
   get 2 more documents. Now I do not want to calculate the Term Frequency for
   the already existing 10 documents but want to calculate the TF for the new
   2 which have come in and then use the TF's for all the 12 documents and
   calculate the IDF for the 12 documents as a whole.

   *How to calculate the IDF of all the documents without calculating the
   TF's of the existing docs again?*

   The number of documents might increase which means using the in memory
   approach (InMemoryBayesDatastore) might become cumbersome. What I want is
   to save the TF of all the documents in an HBASE table and when new
   documents arrive, I calculate the TF of the new documents, save them in the
   HBASE table and then I use this HBASE table to fetch the TF of all the
   documents to calculate the IDF.

   *How can I use HBase to provide data to Mahout's Text Similarity instead
   of fetching it from the sequence file?*


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message