mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From syed kather <>
Subject Re: Clustering or Classification?
Date Wed, 01 Aug 2012 18:02:10 GMT
Hi salman mahmood,
    Whydont you try to apply clustering first . Once you applied high level
clustering then check the top terms . You avoid the cluster which you feel
good and try to find inter cluster which you found that it has confusion .
Once you found that all the clusters are fine . To make the cluster perfect
I had indexed all the document into solr . Because by using solr I had
removed stop words and applied snow ball filter like that .
Then as you know the identified all the clusters . Now try to verify
whether cluster top term are good . Now from that cluster by using cluster
points split the documents and according to its cluster . Now you will have
bunch document s as group . Now if you apply classification and train the
set .

I hope u understood .. this is the approach I had followed . Let me know if
you had some ideas .
Syed Abdul kather
send from Samsung S3
On Aug 1, 2012 10:38 PM, "Salman Mahmood" <> wrote:

> Hi all,
> I am stuck between a decision to apply classification or clustering on the
> data set I got. The more I think about it, the more I get confused. Heres
> what I am confronted with.
> I have got news documents (around 3000 and continuously increasing)
> containing news about companies, investment, stocks, economy, quartly
> income etc. My goal is to have the news sorted in such a way that I know
> which news correspond to which company. e.g for the news item "Apple
> launches new iphone", I need to associate the company Apple with it. A
> particular news item/document only contains 'title' and 'description' so I
> have to analyze the text in order to find out which company the news
> referes to. It could be multiple companies too.
> To solve this, I turned to Mahout.
> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
> etc as top terms in my clusters and from there I would know the news in a
> cluster corresponds to its cluster label, but things were a bit different.
> I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal',
> 'shares', 'street', 'olympics' and lots of other terms as the top ones
> (which makes sense as clustering algos' look for common terms). Although
> there were some 'Apple' clusters but the news items associated with it were
> very few.I thought may be clustering is not for this kind of problem as
> many of the company news goes into more general clusters(investment,
> profit) instead of the specific company cluster(Apple).
> I started reading about classification which requires training data, The
> name was convincing too as I actually want to 'classify' my news items into
> 'company names'. As I read on, I got an impression that the name
> classification is a bit deceiving and the technique is used more for
> prediction purposes as compared to classification. The other confusions
> that I got was how can I prepare training data for news documents? lets
> assume I have a list of companies that I am interested in. I write a
> program to produce training data for the classifier. the program will see
> if the news title or description contains the company name 'Apple' then its
> a news story about apple. Is this how I can prepare training data?(off
> course I read that training data is actually a set of predictors and target
> variables). If so, then why should I use mahout classification in the first
> place? I should ditch mahout and instead use this little program that I
> wrote for training data(which actually does the classification)
> You can see how confused I am about how to address this issue. Another
> thing that concerns me is that if its possible to make a system this
> intelligent, that if the news says 'iphone sales at a record high' without
> using the word 'Apple', the system can classify it as a news related to
> apple?
> Thank you in advance for pointing me in the right direction.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message