mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: Clustering Demo
Date Thu, 05 Jun 2008 19:09:37 GMT
Any more thoughts on this subject? I'll start coding this tuesday.


24 maj 2008 kl. 18.10 skrev Karl Wettin:
> 24 maj 2008 kl. 13.13 skrev Grant Ingersoll:
>> These are interesting. Perhaps you want to commit LUCENE-725?
> If I end up using it for this, then I will. Never tried it out and  
> there are no test cases so I have no clue to how well it works. Nor  
> are there any demonstrations of the features in the patch, but I  
> suppose our demo could be used to produce that.
> I'll train it with the last few paragraphs on a per-author basis too  
> see how well it works.
> We might want to wash out stuff like  "24 maj 2008 kl. 13.13 skrev  
> Grant Ingersoll" too. That should not be to hard to figure out using  
> the headers if the data is stored in a way that allows for  
> navigation in the thread.
> But I'm honestly not sure if this is preemptive overkill solutions.  
> Perhaps algorithms automatically penalise unrelated text when given  
> enough semiotic data. Perhaps attribute selection does the same job  
> in a shorter time.
>> I was wondering whether we should consider asking Lucene to put up  
>> an Analyzer only jar (i.e. a separate jar that combiners the  
>> Analyzer/TokenStream definitions with the contrib Analyzers  
>> package.)  Of course, we may have uses for the rest of Lucene as  
>> well, so maybe not.
> To me that just sounds like more work for both projects.
> I'd be great if we managed to put all future text analysis  
> improvements as patches in Lucene rather than Mahout, but in the  
> long run I think we'll be branching quite a bit of the Lucene  
> analysis code to avoid spending time writing backwards compatible  
> code to support Lucene- rather than Mahout users. See LUCENE-889.
>     karl

View raw message