mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Angel Luis Scull <ascu...@facinf.uho.edu.cu>
Subject Document vector
Date Tue, 26 Nov 2013 20:50:58 GMT
Hello, I'm trying to use mahout in  Topic Detection an Tracking(TDT) System.
Currently I'm doing the Track task of TDT and and i need to develop the 
following algorithm using mahout:

1 Th = set of training documents
2 VTd = is the vector representation of Th
3 For each document D in the stream(unknown number of documents) of 
documents
     do
         (a) Use D to update idf statistics
         (b) apply tf*idf to VD and to VTd (when VD is the vector 
representation of document D)
         (c) Compute the similarity between  VD and VTd
          and so on  ....

  Mi problem is when i try to make a RandomAccessSparseVector. I don't 
know how to create that vector from a sequence file that contains a 
current document in the stream.

Thanks in advance.



Mime
View raw message