lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: SOLR + Nutch set up (UNCLASSIFIED)
Date Wed, 03 Aug 2016 23:07:20 GMT
Depending on your settings, Nutch does this as well. It is even possible to set up different
inc/decremental values per mime-type. 
The algorithms are pluggable and overridable at any point of interest. You can go all the
way.  
 
-----Original message-----
> From:Walter Underwood <wunder@wunderwood.org>
> Sent: Wednesday 3rd August 2016 20:03
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the page had
not changed.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscalone@gmail.com> wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>      - calculate a delta = fetchTime - modifiedTime
> >>      - try to synchronize with the time of change, by shifting the next
> >>      fetchTime by a fraction of the difference between the last modification
> >>      time and the last fetch time. I.e. the next fetch time will be set to fetchTime
> >>      + fetchInterval - delta * SYNC_DELTA_RATE
> >>      - if the adjusted fetch interval is bigger than the delta, then fetchInterval
> >>      = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> >> the algorithm, so that the fetch interval either increases or decreases
> >> infinitely, with little relevance to the page changes. Please use
> >> main(String[])
> >> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood <wunder@wunderwood.org>:
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> >> in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> >> to Mike Lynch. He looked at me like I had three heads and didn’t even
> >> answer me.
> >> 
> >> Ultraseek also has great support for sites that need login. If you use
> >> that, you’ll need to find a way to do that with another crawler.
> >> 
> >> wunder
> >> Walter Underwood
> >> Former Ultraseek Principal Engineer
> >> wunder@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >> 
> >> 
> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
> >> <kris.t.musshorn.ctr@mail.mil> wrote:
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >>> 
> >>> We are currently using ultraseek and looking to deprecate it in favor of
> >> solr/nutch.
> >>> Ultraseek runs all the time and auto detects when pages have changed and
> >> automatically reindexes them.
> >>> Is this possible with SOLR/nutch?
> >>> 
> >>> Thanks,
> >>> Kris
> >>> 
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Kris T. Musshorn
> >>> FileMaker Developer - Contractor - Catapult Technology Inc.
> >>> US Army Research Lab
> >>> Aberdeen Proving Ground
> >>> Application Management & Development Branch
> >>> 410-278-7251
> >>> kris.t.musshorn.ctr@mail.mil
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> 
> >>> 
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >> 
> >> 
> 
> 

Mime
View raw message