lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: SOLR + Nutch set up (UNCLASSIFIED)
Date Thu, 04 Aug 2016 00:03:13 GMT
Ah, the difference between open source and a product. With Ultraseek, we chose a solid, stable
algorithm that worked well for 3000 customers. In open source, it is a research project for
every single customer.

I love open source. I’ve brought Solr into Netflix and Chegg. But there is a clear difference
between developer-driven and customer-driven software.

I first learned about bounded binary exponential backoff in the Digital/Intel/Xerox (“DIX”)
Ethernet spec in 1980. It is a solid algorithm for events with a Poisson distribution, like
packet arrival times or web page next change times. There is no need for configuring algorithms
here, especially configurations that lead to an unstable estimate. The only meaningful choices
are the minimum revisit time, the maximum revisit time, and the number of bins. Those will
be different for CNN (a launch customer for Ultraseek) or Sun documentation (another launch
customer). CNN news articles change minute by minute, new Sun documentation appeared weekly
or monthly.

Sorry for the rant, but “you can fix the algorithm yourself” almost always means a bad
installation, an unhappy admin, and another black eye for open source.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 4:07 PM, Markus Jelsma <markus.jelsma@openindex.io> wrote:
> 
> Depending on your settings, Nutch does this as well. It is even possible to set up different
inc/decremental values per mime-type. 
> The algorithms are pluggable and overridable at any point of interest. You can go all
the way.  
> 
> -----Original message-----
>> From:Walter Underwood <wunder@wunderwood.org>
>> Sent: Wednesday 3rd August 2016 20:03
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
>> 
>> That’s good news.
>> 
>> It should reset the interval estimate on page change instead of slowly shortening
it.
>> 
>> I’m pretty sure that Ultraseek used a bounded exponential backoff when the page
had not changed.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscalone@gmail.com> wrote:
>>> 
>>> Nutch also has adaptive strategy:
>>> 
>>> This class implements an adaptive re-fetch algorithm. This works as
>>>> follows:
>>>> 
>>>>  - for pages that has changed since the last fetchTime, decrease their
>>>>  fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>>>  - for pages that haven't changed since the last fetchTime, increase
>>>>  their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>>>  If SYNC_DELTA property is true, then:
>>>>     - calculate a delta = fetchTime - modifiedTime
>>>>     - try to synchronize with the time of change, by shifting the next
>>>>     fetchTime by a fraction of the difference between the last modification
>>>>     time and the last fetch time. I.e. the next fetch time will be set to
fetchTime
>>>>     + fetchInterval - delta * SYNC_DELTA_RATE
>>>>     - if the adjusted fetch interval is bigger than the delta, then fetchInterval
>>>>     = delta.
>>>>  - the minimum value of fetchInterval may not be smaller than
>>>>  MIN_INTERVAL (default is 1 minute).
>>>>  - the maximum value of fetchInterval may not be bigger than
>>>>  MAX_INTERVAL (default is 365 days).
>>>> 
>>>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
>>>> the algorithm, so that the fetch interval either increases or decreases
>>>> infinitely, with little relevance to the page changes. Please use
>>>> main(String[])
>>>> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>>>> method to test the values before applying them in a production system.
>>>> 
>>> 
>>> From:
>>> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
>>> 
>>> 
>>> 2016-08-03 14:45 GMT-03:00 Walter Underwood <wunder@wunderwood.org>:
>>> 
>>>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
>>>> in Ultraseek.
>>>> 
>>>> I think we were the only people who built an adaptive crawler for
>>>> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
>>>> to Mike Lynch. He looked at me like I had three heads and didn’t even
>>>> answer me.
>>>> 
>>>> Ultraseek also has great support for sites that need login. If you use
>>>> that, you’ll need to find a way to do that with another crawler.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> Former Ultraseek Principal Engineer
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> 
>>>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>>>> <kris.t.musshorn.ctr@mail.mil> wrote:
>>>>> 
>>>>> CLASSIFICATION: UNCLASSIFIED
>>>>> 
>>>>> We are currently using ultraseek and looking to deprecate it in favor
of
>>>> solr/nutch.
>>>>> Ultraseek runs all the time and auto detects when pages have changed
and
>>>> automatically reindexes them.
>>>>> Is this possible with SOLR/nutch?
>>>>> 
>>>>> Thanks,
>>>>> Kris
>>>>> 
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> Kris T. Musshorn
>>>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>>>> US Army Research Lab
>>>>> Aberdeen Proving Ground
>>>>> Application Management & Development Branch
>>>>> 410-278-7251
>>>>> kris.t.musshorn.ctr@mail.mil
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> 
>>>>> 
>>>>> 
>>>>> CLASSIFICATION: UNCLASSIFIED
>>>> 
>>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message