Hi Ted,
first thanks for your answer.
> In any case, you should look at the format already in use by Mahout tools to
> match those to what you do.
>
> I currently see one major issue with this approach: our feature space
>>  and thus our feature vectors  will probably get very large when
>> many documents are scanned. This will obviously lead to the clustering
>> being very slow.
>
>
> Not necessarily. If you use sparse vectors, then the dot products required
> are pretty fast.
Yes, we are already using the Mahout provided sparse vector implementation.
> In addition, you could make use of some of the random indexing or simhash
> methods. At the simplest level, simply assign a low dimensional unit length
> random vector to each feature. Then the vector for each document would be
> the IDF weighted sum of the vectors for each feature. If you use a moderate
> to high dimensional vector (> 50 dimensions, <500), this will give you nice
> fixed length vectors that will pretty accurately preserve dot products of
> the original data.
Random Indexing sounds like a really interesting technique for feature
extraction in the tuwoc project.
I think this approach combined with a more selective data cleansing
and feature selection should provide us with reasonable feature
vectors. Thanks for the hint!
Cheers
Max
PS:
Do you have further details on the SVD implementation you were
mentioning (e.g. a paper detailing the approach). I once implemented
a streaming SVD algorithm for Cell and would be interested to see the
algorithm that was used for Mahout. Thanks.
On Mon, Nov 30, 2009 at 12:25 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <mheimel@googlemail.com> wrote:
>
>> ...
>> Currently we do a rather simple process: compute for each document
>> TFIDF of all terms in the corpus. This is implemented straightforward
>> as a twostep map/reduce job. First a map job computes (and serializes
>> to HBASE) TF histograms for each document. Then a reduce job computes
>> the IDF of all terms occuring in the corpus and serializes the list of
>> term/IDF pairs to HDFS. Finally, a third map job uses the serialized
>> term/IDF pairs and TF histograms to compute a feature vector for each
>> document. So basically, our feature space is the set of all term/IDF
>> pairs.
>>
>
> You could also use the code in Mahout that allows you to write a Lucene
> index as a sequence of document vectors.
>
> In any case, you should look at the format already in use by Mahout tools to
> match those to what you do.
>
> I currently see one major issue with this approach: our feature space
>>  and thus our feature vectors  will probably get very large when
>> many documents are scanned. This will obviously lead to the clustering
>> being very slow.
>
>
> Not necessarily. If you use sparse vectors, then the dot products required
> are pretty fast.
>
>
>> We probably will have to perform some kind of feature
>> reduction during the feature extraction to get smaller  but still
>> expressive  feature vectors. One idea would e.g. be to perform PCA on
>> the "complete" feature vectors in order to identify dimensions that
>> can be pruned. However, this might be computationally too expensive.
>> Since I am not very experienced in this field, I hoped that some of
>> you could share their thoughts or sugestions on the issue.
>>
>
> Read the archives for Mahoutdev. Several developers are working on SVD
> decomposition algorithms that run in parallel to do what you need.
>
> In addition, you could make use of some of the random indexing or simhash
> methods. At the simplest level, simply assign a low dimensional unit length
> random vector to each feature. Then the vector for each document would be
> the IDF weighted sum of the vectors for each feature. If you use a moderate
> to high dimensional vector (> 50 dimensions, <500), this will give you nice
> fixed length vectors that will pretty accurately preserve dot products of
> the original data. This is pretty much what a simhash does, except that
> simhashes go one step further and binarize the vectors.
>
> A slightly more involved approach would be to use these same initial random
> vectors and update them so that the new vectors for each term or other
> feature are the IDF weighted sum of terms or features that occur nearby.
> This makes it so that terms that appear in similar contexts will have
> similar directions which is a simple step toward getting vectors with
> semantic values. This can be done very efficiently in a single mapreduce
> pass over your data. You would use these feature vectors just like you did
> with the simhash technique. This class of techniques is known as random
> indexing.
>
> A third approach is to use random projects to make computation of the SVD of
> the document x term matrix tractable. In fact, for your purposes, you can
> use a truncated and simplified algorithm that only computes the singular
> vectors for the terms which you would then use in similar wise to the first
> two methods. In contrast, you can also use the truncated algorithm to only
> compute the singular vectors for the documents without ever computing term
> vectors. This is useful if all you want to do is cluster the documents and
> don't really care about the term vectors. As I mentioned, there should be a
> bit of code appearing shortly that does some or all of this. The speed of
> this approach should be quite high.
>
