mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Clustering Demo
Date Sat, 24 May 2008 16:10:53 GMT

24 maj 2008 kl. 13.13 skrev Grant Ingersoll:
> These are interesting. Perhaps you want to commit LUCENE-725?

If I end up using it for this, then I will. Never tried it out and  
there are no test cases so I have no clue to how well it works. Nor  
are there any demonstrations of the features in the patch, but I  
suppose our demo could be used to produce that.

I'll train it with the last few paragraphs on a per-author basis too  
see how well it works.


We might want to wash out stuff like  "24 maj 2008 kl. 13.13 skrev  
Grant Ingersoll" too. That should not be to hard to figure out using  
the headers if the data is stored in a way that allows for navigation  
in the thread.


But I'm honestly not sure if this is preemptive overkill solutions.  
Perhaps algorithms automatically penalise unrelated text when given  
enough semiotic data. Perhaps attribute selection does the same job in  
a shorter time.

>  I was wondering whether we should consider asking Lucene to put up  
> an Analyzer only jar (i.e. a separate jar that combiners the  
> Analyzer/TokenStream definitions with the contrib Analyzers  
> package.)  Of course, we may have uses for the rest of Lucene as  
> well, so maybe not.


To me that just sounds like more work for both projects.

I'd be great if we managed to put all future text analysis  
improvements as patches in Lucene rather than Mahout, but in the long  
run I think we'll be branching quite a bit of the Lucene analysis code  
to avoid spending time writing backwards compatible code to support  
Lucene- rather than Mahout users. See LUCENE-889.


      karl

Mime
View raw message