lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: SOLR + Nutch set up (UNCLASSIFIED)
Date Wed, 03 Aug 2016 18:03:26 GMT
That’s good news.

It should reset the interval estimate on page change instead of slowly shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page had not
changed.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscalone@gmail.com> wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>      - calculate a delta = fetchTime - modifiedTime
>>      - try to synchronize with the time of change, by shifting the next
>>      fetchTime by a fraction of the difference between the last modification
>>      time and the last fetch time. I.e. the next fetch time will be set to fetchTime
>>      + fetchInterval - delta * SYNC_DELTA_RATE
>>      - if the adjusted fetch interval is bigger than the delta, then fetchInterval
>>      = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
>> the algorithm, so that the fetch interval either increases or decreases
>> infinitely, with little relevance to the page changes. Please use
>> main(String[])
>> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood <wunder@wunderwood.org>:
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
>> in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for
>> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
>> to Mike Lynch. He looked at me like I had three heads and didn’t even
>> answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you use
>> that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>> <kris.t.musshorn.ctr@mail.mil> wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn.ctr@mail.mil
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message