mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Clustering Demo
Date Fri, 23 May 2008 18:15:48 GMT

17 maj 2008 kl. 13.39 skrev Grant Ingersoll:
>
> On May 12, 2008, at 11:24 AM, Karl Wettin wrote:


Did anyone do anything with this? If not I'll come up with something  
in the begining of June. I think it should be abstract enough to  
handle other similar data sources (Apache mbox archives).


>> In what way can we prepare so it makes as much sense for as many  
>> things as possible we might want to show off? What class fields can  
>> we extract from the headers except for author and thread identity?  
>> How do we want to tokenize the text (grams of words and  
>> charachters, stemming, stopwords, etc), do we want to seperate  
>> quotation from author text so we can use diffrent weights to  
>> quotation, et c?
>
> Let's just start simple with words and then enhance.


It might be interesting to take a look at what sort of tokenizer other  
libs do, the Weka StringToWordVector for instance (best viewed from  
their GUI). We should be able to much better than that with whats  
available in Lucene. But a default chain of token streams that is easy  
to set up is not a bad idea.

I also think we want some simple algorithmic stop word extraction.  
There is a simple one in LUCENE-1025 with the incorrect name  
HacGqfTermReducer.java.

It would be a simple thing to support different weights for subject  
and body. Or any other field we might extract in the future (quoted  
body, et c).

We also want to get right of signatures with quotes and what not in.  
That should be handled by some pre-pre-processing layer though if you  
ask me. LUCENE-725 can help out.


Should we perhaps make this thread an issue?



      karl

Mime
View raw message