nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: [VOTE] Release Apache Nutch 1.15 RC#1
Date Wed, 01 Aug 2018 09:59:02 GMT
However, the test crawl ran/runs fine, in the background, no errors. But just now, watching
the fetcher, i noticed the crawl delay is not always respected. The only configuration change
i have is the http.agent.* directives to run.

2018-08-01 11:47:41,256 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/rqlNNVQgix
(queue crawl delay=5000ms)
2018-08-01 11:47:41,319 INFO  fetcher.FetcherThread - FetcherThread 51 fetching http://planet.apache.org/
(queue crawl delay=5000ms)
2018-08-01 11:47:41,324 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher',
using default
2018-08-01 11:47:41,325 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://schema.org/Event
(queue crawl delay=5000ms)
2018-08-01 11:47:41,515 INFO  fetcher.FetcherThread - FetcherThread 44 fetching http://people.apache.org/~jianhe
(queue crawl delay=5000ms)
2018-08-01 11:47:41,532 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher',
using default
2018-08-01 11:47:41,533 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://en.wikipedia.org/wiki/Internet_marketing
(queue crawl delay=5000ms)
2018-08-01 11:47:41,600 INFO  fetcher.FetcherThread - FetcherThread 44 fetching https://apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.zip.asc
(queue crawl delay=5000ms)
2018-08-01 11:47:41,607 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher',
using default
2018-08-01 11:47:41,608 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://twitter.com/i/directory/profiles/5
(queue crawl delay=5000ms)
2018-08-01 11:47:41,673 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Categories
(queue crawl delay=5000ms)
2018-08-01 11:47:41,688 INFO  fetcher.FetcherThread - FetcherThread 52 fetching http://photomatt.net/
(queue crawl delay=5000ms)
2018-08-01 11:47:41,696 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://cy.wikipedia.org/wiki/Wicipedia:Cysylltwch_%C3%A2_ni
(queue crawl delay=5000ms)
2018-08-01 11:47:41,752 INFO  fetcher.FetcherThread - FetcherThread 48 fetching https://mobile.twitter.com/david_kunz/followers
(queue crawl delay=5000ms)
2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/xEOAFfp7lT
(queue crawl delay=5000ms)
2018-08-01 11:47:41,863 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Q9BJ0FhzzF
(queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/wWIMOZ3wxg
(queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/dImmnEeXjb
(queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/IPPSdW6o52
(queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/Y85UlnueSC
(queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/TvZSGiZC9D
(queue crawl delay=5000ms)
2018-08-01 11:47:41,864 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/jG7BvlobXD
(queue crawl delay=5000ms)
2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/ZJmzbWVFrh
(queue crawl delay=5000ms)
2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching https://t.co/atVcrbCi5q
(queue crawl delay=5000ms)
2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://avro.apache.org/releases.html
(queue crawl delay=5000ms)
2018-08-01 11:47:41,865 INFO  fetcher.FetcherThread - FetcherThread 43 fetching https://issues.apache.org/jira/browse/HADOOP-15283
(queue crawl delay=5000ms)
2018-08-01 11:47:42,175 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=500,
fetchQueues.getQueueCount=67
2018-08-01 11:47:42,225 INFO  fetcher.FetcherThread - FetcherThread 47 fetching http://www.aetna.com/
(queue crawl delay=5000ms)
2018-08-01 11:47:42,316 INFO  fetcher.FetcherThread - FetcherThread 49 fetching http://www.miredot.com/
(queue crawl delay=5000ms)
2018-08-01 11:47:42,357 INFO  fetcher.FetcherThread - FetcherThread 48 fetching http://xmlgraphics.apache.org/batik/
(queue crawl delay=5000ms)
2018-08-01 11:47:42,402 INFO  fetcher.FetcherThread - FetcherThread 49 fetching https://t.co/XgG7zomVs8
(queue crawl delay=5000ms)

I believe this problem should addressed prior to release,  therefore i withdraw my +1. Because
this is not a breaking issue, i will not -1 this RC.

Regards,
Markus

 
 
-----Original message-----
> From:Markus Jelsma <markus.jelsma@openindex.io>
> Sent: Wednesday 1st August 2018 11:38
> To: dev@nutch.apache.org; user@nutch.apache.org
> Subject: RE: [VOTE] Release Apache Nutch 1.15 RC#1
> 
> All tests pass, crawler run fine so far, +1 for 1.15!
> 
> Regards,
> Markus
> 
>  
>  
> -----Original message-----
> > From:Sebastian Nagel <wastl.nagel@googlemail.com>
> > Sent: Thursday 26th July 2018 17:05
> > To: user@nutch.apache.org
> > Cc: dev@nutch.apache.org
> > Subject: [VOTE] Release Apache Nutch 1.15 RC#1
> > 
> > Hi Folks,
> > 
> > A first candidate for the Nutch 1.15 release is available at:
> > 
> >   https://dist.apache.org/repos/dist/dev/nutch/1.15/
> > 
> > The release candidate is a zip and tar.gz archive of the binary and sources in:
> >   https://github.com/apache/nutch/tree/release-1.15
> > 
> > The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
> >    555d00ddc0371b05c5958bde7abb2a9db8c38ee2
> > 
> > In addition, a staged maven repository is available here:
> >    https://repository.apache.org/content/repositories/orgapachenutch-1015/
> > 
> > We addressed 119 Issues:
> >    https://s.apache.org/nczS
> > 
> > Please vote on releasing this package as Apache Nutch 1.15.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Nutch PMC votes are cast.
> > 
> > [ ] +1 Release this package as Apache Nutch 1.15.
> > [ ] -1 Do not release this package because…
> > 
> > Cheers,
> > Sebastian
> > (On behalf of the Nutch PMC)
> > 
> > P.S. Here is my +1.
> > 
> 

Mime
View raw message