mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Clustering Demo
Date Thu, 05 Jun 2008 22:02:33 GMT

On Jun 5, 2008, at 3:09 PM, Karl Wettin wrote:

> Any more thoughts on this subject? I'll start coding this tuesday.

+1.  Much easier to have thoughts on a patch.

>          karl
> 24 maj 2008 kl. 18.10 skrev Karl Wettin:
>> 24 maj 2008 kl. 13.13 skrev Grant Ingersoll:
>>> These are interesting. Perhaps you want to commit LUCENE-725?
>> If I end up using it for this, then I will. Never tried it out and  
>> there are no test cases so I have no clue to how well it works. Nor  
>> are there any demonstrations of the features in the patch, but I  
>> suppose our demo could be used to produce that.
>> I'll train it with the last few paragraphs on a per-author basis  
>> too see how well it works.
>> We might want to wash out stuff like  "24 maj 2008 kl. 13.13 skrev  
>> Grant Ingersoll" too. That should not be to hard to figure out  
>> using the headers if the data is stored in a way that allows for  
>> navigation in the thread.
>> But I'm honestly not sure if this is preemptive overkill solutions.  
>> Perhaps algorithms automatically penalise unrelated text when given  
>> enough semiotic data. Perhaps attribute selection does the same job  
>> in a shorter time.
>>> I was wondering whether we should consider asking Lucene to put up  
>>> an Analyzer only jar (i.e. a separate jar that combiners the  
>>> Analyzer/TokenStream definitions with the contrib Analyzers  
>>> package.)  Of course, we may have uses for the rest of Lucene as  
>>> well, so maybe not.
>> To me that just sounds like more work for both projects.
>> I'd be great if we managed to put all future text analysis  
>> improvements as patches in Lucene rather than Mahout, but in the  
>> long run I think we'll be branching quite a bit of the Lucene  
>> analysis code to avoid spending time writing backwards compatible  
>> code to support Lucene- rather than Mahout users. See LUCENE-889.
>>    karl

Grant Ingersoll

View raw message