mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: Clustering single doc as multiple docs
Date Sat, 01 May 2010 11:27:42 GMT
Thanks Ted! That was what I needed!

On Fri, Apr 30, 2010 at 10:21 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Yes.  Splitting by paragraph should work fine (been there, done that).
>
> Splitting by sentence works well if you does something like SVD to smooth
> over the fact that you have few words per sentence.
>
> Splitting by paragraph is pretty easy, but corpus specific.  For plain
> text,
> try looking for blank lines.  For HTML make a list of breaking markup and
> insert split points whereever you find those.  For other formats you will
> need to put on your thinking cap.
>
> Sentence splitting is easy to do 90% correctly, hard to do better than 99%
> especially in some domains.  For your purposes, 90% is probably fine.
>  Start
> with the simplest possible case and add a few special cases and you will be
> set.  There may be usable software to be found on the net, but your needs
> are very modest.
>
> Good luck!
>
> Let us know how it goes.
>
> On Fri, Apr 30, 2010 at 10:32 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > Btw, why do you think splitting and clustering won't work? Have anybody
> > tried this?
> > I am not sure it will be successful but I also do not have the arguments
> > that it should not lead to a meaningful result.
> > If I split a doc per sentence it might not get good results but if I use
> > larger pieces, e.g. paragraphs it might give some topics (sets of
> > keywords).
> > Anyone tried something like this?
> >
>



-- 
Best regards,
Bogdan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message