lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
Date Wed, 03 Aug 2016 23:10:17 GMT
No, just run it continously, always! By default everything is refetched (if possible) every
30 days. Just read the descriptions for adaptive schedule and its javadoc. It is simple to
use, but sometimes hard to predict its outcome, just because you never know what changes,
at whatever time.

You will be fine with defaults if you have a small site. Just set the interval to a few days,
or more if your site is slightly larger.

M.

 
 
-----Original message-----
> From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) <kris.t.musshorn.ctr@mail.mil>
> Sent: Wednesday 3rd August 2016 20:08
> To: solr-user@lucene.apache.org
> Subject: RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> Shall I assume that, even though nutch has adaptive capability, I would still have to
figure out how to trigger it to go look for content that needs update?
> 
> Thanks,
> Kris
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.      
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> kris.t.musshorn.ctr@mail.mil
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org] 
> Sent: Wednesday, August 03, 2016 2:03 PM
> To: solr-user@lucene.apache.org
> Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> All active links contained in this email were disabled.  Please verify the identity of
the sender, and confirm the authenticity of all links contained within the message prior to
copying and pasting the address to a Web browser.  
> 
> 
> 
> 
> ----
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the page had
not changed.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> Caution-http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscalone@gmail.com> wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>      - calculate a delta = fetchTime - modifiedTime
> >>      - try to synchronize with the time of change, by shifting the next
> >>      fetchTime by a fraction of the difference between the last modification
> >>      time and the last fetch time. I.e. the next fetch time will be set to fetchTime
> >>      + fetchInterval - delta * SYNC_DELTA_RATE
> >>      - if the adjusted fetch interval is bigger than the delta, then fetchInterval
> >>      = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
> >> destabilize the algorithm, so that the fetch interval either 
> >> increases or decreases infinitely, with little relevance to the page 
> >> changes. Please use
> >> main(String[])
> >> <Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc
> >> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> > crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood <wunder@wunderwood.org>:
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
> >> crawler in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for 
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the 
> >> argument to Mike Lynch. He looked at me like I had three heads and 
> >> didn’t even answer me.
> >> 
> >> Ultraseek also has great support for sites that need login. If you 
> >> use that, you’ll need to find a way to do that with another crawler.
> >> 
> >> wunder
> >> Walter Underwood
> >> Former Ultraseek Principal Engineer
> >> wunder@wunderwood.org
> >> Caution-http://observer.wunderwood.org/  (my blog)
> >> 
> >> 
> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL 
> >>> (US)
> >> <kris.t.musshorn.ctr@mail.mil> wrote:
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >>> 
> >>> We are currently using ultraseek and looking to deprecate it in 
> >>> favor of
> >> solr/nutch.
> >>> Ultraseek runs all the time and auto detects when pages have changed 
> >>> and
> >> automatically reindexes them.
> >>> Is this possible with SOLR/nutch?
> >>> 
> >>> Thanks,
> >>> Kris
> >>> 
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Kris T. Musshorn
> >>> FileMaker Developer - Contractor - Catapult Technology Inc.
> >>> US Army Research Lab
> >>> Aberdeen Proving Ground
> >>> Application Management & Development Branch
> >>> 410-278-7251
> >>> kris.t.musshorn.ctr@mail.mil
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> 
> >>> 
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >> 
> >> 
> 
> 
> CLASSIFICATION: UNCLASSIFIED
> 

Mime
View raw message