nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Filter spam URLs
Date Fri, 07 Dec 2007 13:51:59 GMT
Ned Rockson wrote:
> I've been searching for a bit on the forums to see if anyone is in the 
> process of producing a spam filter heuristic for URLs.  I assume that 
> most spam is nondeterministic, but after a crawl of ~50M URLs, there are 
> a bunch that are obviously spam because their URLs are simply 
> nonsensical (like I would automatically filter 
> out).  Is anyone currently working on this or has there been any effort 
> in the past?  Also, does anyone know of any literature published about 
> this?  A quick google search netted only email spam filters using naive 
> bayes.

If you have an ACM Library subscription, this is a good source for 
published papers on the subject. Similarly Citeseer, although the papers 
there tend to be older.

Apart from that, I sometimes use a heuristic that all-numeric (or mostly 
numeric) url components indicate spam links. Example:, This is easy to implement as a 
URLFilter, in addition to other simple checks (e.g. max. url length, max 
number of path levels, presence of special characters, abundance of 
non-plain-text looking sections, ...)

Other techniques depend on link graph analysis - especially interesting 
is to collect per-host or per-domain or per-subdomain link statistics, 
both the outgoing and incoming. This requires writing a relatively 
simple map-reduce job to aggregate the results per-host from an existing 

You can use the results in many interesting ways - here's a broad 
overview of some strategies: you could detect dense link communities, 
which may indicate spammy reciprocal linking, or detect domains with 
abundance of links to / from known spam sites, etc ... This data could 
be then used on the fly (in a URLFilter), or to flag existing urls in 
crawldb as spammy. Then, you could implement a scoring filter, which 
uses such flags to carry around a "spam score", in order to poison the 
score of pages linked from known spam pages.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message