nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anjun Chen <anjun.c...@sbcglobal.net>
Subject Re: Automating workflow using ndfs
Date Fri, 02 Sep 2005 19:45:21 GMT
I'm going to make a request in Jira now. -AJ

--- Matt Kangas <kangas@gmail.com> wrote:

> Great! Is there a ticket in JIRA requesting this
> feature? If not, we  
> should file one and get a few votes in favor of it.
> AFAIK, that's the  
> process for getting new features into Nutch.
> 
> On Sep 2, 2005, at 1:30 PM, AJ Chen wrote:
> 
> > Matt,
> > This is great! It would be very useful to Nutch
> developers if your  
> > code can be shared.  I'm sure quite a few
> applications will benefit  
> > from it because it fills a gap between whole-web
> crawling and  
> > single site (or a handful of sites) crawling. 
> I'll be interested  
> > in adapting your plugin to Nutch convention.
> > Thanks,
> > -AJ
> >
> > Matt Kangas wrote:
> >
> >
> >> AJ and Earl,
> >>
> >> I've implemented URLFilters before. In fact, I
> have a   
> >> WhitelistURLFilter that implements just what you
> describe: a   
> >> hashtable of regex-lists. We implemented it
> specifically because  
> >> we  want to be able to crawl a large number of
> known-good paths  
> >> through  sites, including paths through CGIs. The
> hash is a Nutch  
> >> ArrayFile,  which provides low runtime overhead.
> We've tested it  
> >> on 200+ sites  thus far, and haven't seen any
> indication that it  
> >> will have problems  scaling further.
> >>
> >> The filter and its supporting WhitelistWriter
> currently rely on a  
> >> few  custom classes, but it should be
> straightforward to adapt to  
> >> Nutch  naming conventions, etc. If you're
> interested in doing this  
> >> work, I  can see if it's ok to publish our code.
> >>
> >> BTW, we're currently alpha-testing the site that
> uses this  
> >> plugin,  and preparing for a public beta. I'll be
> sure to post  
> >> here when we're  finally open for business. :)
> >>
> >> --Matt
> >>
> >>
> >> On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
> >>
> >>
> >>> From reading http://wiki.apache.org/nutch/  
> >>> DissectingTheNutchCrawler, it seems that a new
> urlfilter is a  
> >>> good  place to extend the inclusion regex
> capability.  The new  
> >>> urlfilter  will be defined by urlfilter.class
> property, which  
> >>> gets loaded by  the URLFilterFactory.
> >>> Regex is necessary because you want to include
> urls matching   
> >>> certain patterns.
> >>>
> >>> Can anybody who implemented URLFilter plugin
> before share some   
> >>> thoughts about this approach? I expect the new
> filter must have  
> >>> all  capabilities that the current
> RegexURLFilter.java has so  
> >>> that it  won't require change in any other
> classes. The  
> >>> difference is that  the new filter uses a hash
> table for  
> >>> efficiently looking up regex  for included
> domains (a large  
> >>> number!).
> >>>
> >>> BTW, I can't find urlfilter.class property in
> any of the   
> >>> configuration files in Nutch-0.7. Does 0.7
> version still support   
> >>> urlfilter extension? Any difference relative to
> what's described  
> >>> in  the doc DissectingTheNutchCrawler cited
> above?
> >>>
> >>> Thanks,
> >>> AJ
> >>>
> >>> Earl Cahill wrote:
> >>>
> >>>
> >>>
> >>>>> The goal is to avoid entering 100,000 regex in
> the
> >>>>> craw-urlfilter.xml and checking ALL these
> regex for each URL.  
> >>>>> Any  comment?
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> Sure seems like just some hash look up table
> could
> >>>> handle it.  I am having a hard time seeing when
> you
> >>>> really need a regex and a fixed list wouldn't
> do. Especially if   
> >>>> you have forward and maybe a backwards
> >>>> lookup as well in a multi-level hash, to
> perhaps
> >>>> include/exclude at a certain sudomain level,
> like
> >>>>
> >>>> include: com->site->good (for good.site.com
> stuff)
> >>>> exclude: com->site->bad (for bad.site.com)
> >>>>
> >>>> and kind of walk backwards, kind of like dns. 
> Then
> >>>> you could just do a few hash lookups instead of
> >>>> 100,000 regexes.
> >>>>
> >>>> I realize I am talking about host and not page
> level
> >>>> filtering, but if you want to include
> everything from
> >>>> your 100,000 sites, I think such a strategy
> could
> >>>> work.
> >>>>
> >>>> Hope this makes sense.  Maybe I could write
> some code
> >>>> to and see if it works in practice.  If nothing
> else,
> >>>> maybe the hash stuff could just be another
> filter
> >>>> option in conf/crawl-urlfilter.txt.
> >>>>
> >>>> Earl
> >>>>
> >>>>
> >>>>
> >>>
> >>> -- 
> >>> AJ (Anjun) Chen, Ph.D.
> >>> Canova Bioconsulting Marketing * BD * Software
> Development
> >>> 748 Matadero Ave., Palo Alto, CA 94306, USA
> >>> Cell 650-283-4091, anjun.chen@sbcglobal.net
> >>>
> ---------------------------------------------------
> >>>
> >>>
> >>
> >> -- 
> >> Matt Kangas / kangas@gmail.com
> >>
> >>
> >>
> >>
> >
> > -- 
> > AJ (Anjun) Chen, Ph.D.
> > Canova Bioconsulting Marketing * BD * Software
> Development
> > 748 Matadero Ave., Palo Alto, CA 94306, USA
> > Cell 650-283-4091, anjun.chen@sbcglobal.net
> >
> ---------------------------------------------------
> >
> 
> --
> Matt Kangas / kangas@gmail.com
> 
> 
> 


Mime
View raw message