mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miles Osborne" <mi...@inf.ed.ac.uk>
Subject Re: text clustering noob
Date Wed, 04 Jun 2008 09:26:33 GMT
if you have:

--a set of snippets
--a set of articles

and for each snippet, you want to find the `matching` set of articles, then
you could:

--treat this as an IR task (a snippet becomes a query)

--treat this as co-clustering (eg http://citeseer.ist.psu.edu/447871.html)

nutch could do the first for you;  right now there is no support in mahout
that i know about for co-clustering

Miles

2008/6/4 Marcus Persson Lindqvist <marcus.persson@gmail.com>:

> Hi list!
>
> I've been looking at mahout since the start and am very excited. However,
> I'm a ML-noob and need some introductory pointers before I can start play.
>
> What I want to do fairly simple: I have small set of text snippets which I
> now match a smaller set of articles, so that an article consists of one or
> more of the text snippets. So I need to group those snippets into articles.
> Preferably would I like to be able to detect "noise" as well (snippet has
> too little or dirty information and is not classified as an article.)
>
> I have access to large training sets of "complete" articles.
>
> Now, anyone got any tip on how to achieve this? Which of the algos
> discussed
> here would be sufficient?
>
> Any help much appreciated.
>
> /Marcus
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message