mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: Clustering single doc as multiple docs
Date Fri, 30 Apr 2010 17:10:36 GMT
Hi Grant,

You are probably right.
What I wanted is to use my mahout setup to extract topics from a single
document.
So, maybe in popular terms I am trying to do topic extraction via document
clustering.
Does it make sense to try to split a doc into sub docs so that I leverage
the clustering algorithm and thus find topic which appear key ones for the
document?

Best regards,
Bogdan

On Fri, Apr 30, 2010 at 6:18 PM, Grant Ingersoll <gsingers@apache.org>wrote:

> This strike me a little bit as an XY problem:
> http://people.apache.org/~hossman/#xyproblem
>
> Perhaps it would be helpful if you could back up a little and describe the
> higher level problem you are trying to solve.  You certainly can split up
> your documents and then cluster them, but I'm not sure that is actually
> going to give you what you need.
>
> Cheers,
> Grant
>
> On Apr 30, 2010, at 5:29 AM, Bogdan Vatkov wrote:
>
> > Hi,
> >
> > I would like to run some clustering for a single document but then I want
> > that multiple clusters are extracted.
> > I guess I have to find a way to split the doc into multiple docs / input
> > vectors but I am wondering if there are any best practices on how to do
> the
> > split then
> > Should I derive vectors based on sentences or paragraphs? Is there a
> > paragraph boundary detection tool around?
> > Any recommendations will be appreciated.
> >
> > Best regards,
> > Bogdan
>
>
>


-- 
Best regards,
Bogdan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message