mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Regarding classification of URL's
Date Tue, 01 Mar 2011 17:36:44 GMT
Scraping or spidering is, indeed, the first step.  Associated with each URL,
you should retain the plain text (without markup), the domain name, all
anchor text for links pointing to each page and a small neighborhood of text
around each link.

>From there, you can use the Naive Bayes classifiers as Vineet suggests or
you can use the SGD classifiers.  The SGD classifiers are more flexible but
performance in terms of accuracy should be similar.  The SGD classifiers are
significantly easier to integrate into other code.

You will need to have labels on a fair number of pages from each category.
 If you can have users tag these pages, that might be helpful.

If you have user interaction logs, you can also use that.

On Tue, Mar 1, 2011 at 3:57 AM, vineet yadav <vineet.yadav.iiit@gmail.com>wrote:

> Hi Arjun,
> you need to scrap content from website for a given url, and then need
> to prepare training datasets from scarped content  for  Bayesian
> classification.
> Also check out mahout twenty news groups example for reference
> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html
> Thanks
> Vineet Yadav
>
> On Tue, Mar 1, 2011 at 5:05 PM, Arjun Kumar Reddy
> <charjunkumar.reddy@iiitb.net> wrote:
> > Hi list,
> >
> > I am a newbie in mahout and I want to now some details regarding this
> > project.
> >
> > I am in need of a classification tool which gives me the category in
> which
> > the URL or content belongs to.
> >
> > For example, If I give this particular URL's
> >
> >
> http://www.espncricinfo.com/icc_cricket_worldcup2011/content/current/player/49764.htmlit
> > should give me the category as "cricket".
> >
> > I was able to do this with other existing API's like alchemy, evri,
> textwise
> > etc. and I am looking for something better in terms of performance.
> >
> > Could anyone please help me how can I use this mahout tool for
> classifying
> > the documents.
> >
> >
> > Thanks and regards,*
> > *Ch. Arjun Kumar Reddy,
> > International Institute of Information Technology – Bangalore (IIITB),
> > 26/C, Electronics City, Hosur Road,
> > Bangalore 560 100
> > Ph: 8800710999*
> > *
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message