nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <anjun.c...@sbcglobal.net>
Subject Re: Automating workflow using ndfs
Date Fri, 02 Sep 2005 15:43:49 GMT
 From reading http://wiki.apache.org/nutch/DissectingTheNutchCrawler, it 
seems that a new urlfilter is a good place to extend the inclusion regex 
capability.  The new urlfilter will be defined by urlfilter.class 
property, which gets loaded by the URLFilterFactory.
Regex is necessary because you want to include urls matching certain 
patterns.

Can anybody who implemented URLFilter plugin before share some thoughts 
about this approach? I expect the new filter must have all capabilities 
that the current RegexURLFilter.java has so that it won't require change 
in any other classes. The difference is that the new filter uses a hash 
table for efficiently looking up regex for included domains (a large 
number!).

BTW, I can't find urlfilter.class property in any of the configuration 
files in Nutch-0.7. Does 0.7 version still support urlfilter extension? 
Any difference relative to what's described in the doc 
DissectingTheNutchCrawler cited above?

Thanks,
AJ

Earl Cahill wrote:

>> The goal is to 
>>avoid entering 100,000 regex in the
>>craw-urlfilter.xml and checking ALL 
>>these regex for each URL. Any comment?
>>    
>>
>
>Sure seems like just some hash look up table could
>handle it.  I am having a hard time seeing when you
>really need a regex and a fixed list wouldn't do. 
>Especially if you have forward and maybe a backwards
>lookup as well in a multi-level hash, to perhaps
>include/exclude at a certain sudomain level, like
>
>include: com->site->good (for good.site.com stuff)
>exclude: com->site->bad (for bad.site.com)
>
>and kind of walk backwards, kind of like dns.  Then
>you could just do a few hash lookups instead of
>100,000 regexes.
>
>I realize I am talking about host and not page level
>filtering, but if you want to include everything from
>your 100,000 sites, I think such a strategy could
>work.
>
>Hope this makes sense.  Maybe I could write some code
>to and see if it works in practice.  If nothing else,
>maybe the hash stuff could just be another filter
>option in conf/crawl-urlfilter.txt.
>
>Earl
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around 
>http://mail.yahoo.com 
>
>  
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message