nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ned Rockson <...@discoveryengine.com>
Subject Filter spam URLs
Date Fri, 07 Dec 2007 01:14:11 GMT
I've been searching for a bit on the forums to see if anyone is in the 
process of producing a spam filter heuristic for URLs.  I assume that 
most spam is nondeterministic, but after a crawl of ~50M URLs, there are 
a bunch that are obviously spam because their URLs are simply 
nonsensical (like 01118273.domain.com I would automatically filter 
out).  Is anyone currently working on this or has there been any effort 
in the past?  Also, does anyone know of any literature published about 
this?  A quick google search netted only email spam filters using naive 
bayes.

Mime
View raw message