mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: Clustering of text data on external categories
Date Fri, 11 Oct 2013 14:48:20 GMT
Just a hint - if you're using Solr/Lucene then you should also
(probably) resign from using field norms (so that each category is
equally scored, regardless of the length of its content). You can also
add term-boosts to individual terms at query time so that when you
have a document that mentions "selling" more frequently you can query
for: selling^1.5 payment^0.5, etc.

There's virtually no limits on how to score/boost terms, experiment
all you like :)

Dawid

On Fri, Oct 11, 2013 at 4:42 PM, Jens Bonerz <jbonerz@googlemail.com> wrote:
> what a nice idea :-) really like that approach
>
>
> 2013/10/11 Ted Dunning <ted.dunning@gmail.com>
>
>> You don't need Mahout for this.
>>
>> A very easy way to do this is to gather all the words for each category
>> into a document.  Thus:
>>
>> CatA:selling buying sales payment
>> CatB:gathering collecting
>> CatC:information data info
>>
>> Then put these into a text retrieval engine so that you have one document
>> per category.
>>
>> When you get a new document to categorize, just use the document as a query
>> and you will get a list of possible categories back.  Make sure you set the
>> default query mode to OR for this.
>>
>> See http://wiki.apache.org/solr/SolrQuerySyntax for more on the syntax.
>>
>>
>>
>> On Fri, Oct 11, 2013 at 5:04 AM, Kasi Subrahmanyam
>> <kasisubbu440@gmail.com>wrote:
>>
>> > Hi,
>> >
>> > I have a problem that i would like to implement in mahout clustering.
>> >
>> > I have input text documents with data like below.
>> >
>> > Document1: This is the first document of selling information.
>> > Document2: This is the second document of gathering information.
>> >
>> > I also have another look up file with data like below
>> > selling:CatA
>> > gathering:CatB.
>> > information:CatC
>> >
>> > NOw i would like to cluster the documents with output being genrated as
>> > Document1:CatA,CatC
>> > Document2:CatB,CatC
>> >
>> > Please let me know how to achieve this.
>> >
>> > Thanks,
>> > Subbu
>> >
>>

Mime
View raw message