nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ Chen <anjun.c...@sbcglobal.net>
Subject Re: Automating workflow using ndfs
Date Fri, 02 Sep 2005 17:30:11 GMT
Matt,
This is great! It would be very useful to Nutch developers if your code 
can be shared.  I'm sure quite a few applications will benefit from it 
because it fills a gap between whole-web crawling and single site (or a 
handful of sites) crawling.  I'll be interested in adapting your plugin 
to Nutch convention.
Thanks,
-AJ

Matt Kangas wrote:

> AJ and Earl,
>
> I've implemented URLFilters before. In fact, I have a  
> WhitelistURLFilter that implements just what you describe: a  
> hashtable of regex-lists. We implemented it specifically because we  
> want to be able to crawl a large number of known-good paths through  
> sites, including paths through CGIs. The hash is a Nutch ArrayFile,  
> which provides low runtime overhead. We've tested it on 200+ sites  
> thus far, and haven't seen any indication that it will have problems  
> scaling further.
>
> The filter and its supporting WhitelistWriter currently rely on a few  
> custom classes, but it should be straightforward to adapt to Nutch  
> naming conventions, etc. If you're interested in doing this work, I  
> can see if it's ok to publish our code.
>
> BTW, we're currently alpha-testing the site that uses this plugin,  
> and preparing for a public beta. I'll be sure to post here when we're  
> finally open for business. :)
>
> --Matt
>
>
> On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
>
>> From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler, 
>> it seems that a new urlfilter is a good  place to extend the 
>> inclusion regex capability.  The new urlfilter  will be defined by 
>> urlfilter.class property, which gets loaded by  the URLFilterFactory.
>> Regex is necessary because you want to include urls matching  certain 
>> patterns.
>>
>> Can anybody who implemented URLFilter plugin before share some  
>> thoughts about this approach? I expect the new filter must have all  
>> capabilities that the current RegexURLFilter.java has so that it  
>> won't require change in any other classes. The difference is that  
>> the new filter uses a hash table for efficiently looking up regex  
>> for included domains (a large number!).
>>
>> BTW, I can't find urlfilter.class property in any of the  
>> configuration files in Nutch-0.7. Does 0.7 version still support  
>> urlfilter extension? Any difference relative to what's described in  
>> the doc DissectingTheNutchCrawler cited above?
>>
>> Thanks,
>> AJ
>>
>> Earl Cahill wrote:
>>
>>
>>>> The goal is to avoid entering 100,000 regex in the
>>>> craw-urlfilter.xml and checking ALL these regex for each URL. Any  
>>>> comment?
>>>>
>>>>
>>>
>>> Sure seems like just some hash look up table could
>>> handle it.  I am having a hard time seeing when you
>>> really need a regex and a fixed list wouldn't do. Especially if  you 
>>> have forward and maybe a backwards
>>> lookup as well in a multi-level hash, to perhaps
>>> include/exclude at a certain sudomain level, like
>>>
>>> include: com->site->good (for good.site.com stuff)
>>> exclude: com->site->bad (for bad.site.com)
>>>
>>> and kind of walk backwards, kind of like dns.  Then
>>> you could just do a few hash lookups instead of
>>> 100,000 regexes.
>>>
>>> I realize I am talking about host and not page level
>>> filtering, but if you want to include everything from
>>> your 100,000 sites, I think such a strategy could
>>> work.
>>>
>>> Hope this makes sense.  Maybe I could write some code
>>> to and see if it works in practice.  If nothing else,
>>> maybe the hash stuff could just be another filter
>>> option in conf/crawl-urlfilter.txt.
>>>
>>> Earl
>>>
>>>
>>
>> -- 
>> AJ (Anjun) Chen, Ph.D.
>> Canova Bioconsulting Marketing * BD * Software Development
>> 748 Matadero Ave., Palo Alto, CA 94306, USA
>> Cell 650-283-4091, anjun.chen@sbcglobal.net
>> ---------------------------------------------------
>>
>
> -- 
> Matt Kangas / kangas@gmail.com
>
>
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message