nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: setting modifiedTime in DefaultFetchSchedule
Date Mon, 19 Nov 2012 23:12:48 GMT
Hi Cesare,

> modifiedTime = fetchTime;
> instead of:
> if (modifiedTime <= 0) modifiedTime = fetchTime;
This will always overwrite modified time with the time the fetch took place.
I would prefer the way as it's done in AdaptiveFetchSchedule:
only set modifiedTime if it's unset (=0).

After a closer look at 1.x regarding this point I can confirm:
- with DefaultFetchSchedule the modifiedTime is never set / always 0

> I don't know if this is correct (probably not) but at least 304 seems to be
> handled. In particular, in the protocol-file (File.getProtocolOutput) I've
> added a special case for 304:
>
> if (code == 304) { // got a not modified response
>     return new ProtocolOutput(response.toContent(),
>       ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
>         }
>
> I suppose this is NOT the right solution :-)
At a first glance, it's not bad. Protocol-file needs obviously a revision:
the 304 is set properly in FileResponse.java but in File.java it is treated as
redirect:
   else if (code >= 300 && code < 400) { // handle redirect
So, thanks. Good catch!

Would be great if you could open Jira issues for
- setting modified time in DefaultSchedule
- 304 handling in protocol-file
If you can provide patches, even better. Thanks!

About your problem with removal / re-adding files:
- a file system is crawled as if linked web pages:
  a directory is just an HTML page with all files and sub-directories
  as links.
- re-crawling does not necessarily remove deleted files from the index.
  The I had a cloURL/path to a deleted file is kept forever
  until it's removed explicitely.
- You have to force a re-fetch of the URL/file to be sure it is still
  present or has been removed. If 304 handling is working, this should
  be quite cheap for file system crawls because a re-parse is not necessary.

Ciao,
Sebastian


On 11/19/2012 05:30 PM, Cesare Zavattari wrote:
> Ciao,
> in the meanwhile I've done some other test using nutch 2.1 with
> DefaultFetchSchedule where I've put:
> 
> modifiedTime = fetchTime;
> 
> instead of:
> 
> if (modifiedTime <= 0) modifiedTime = fetchTime;
> 
> I don't know if this is correct (probably not) but at least 304 seems to be
> handled. In particular, in the protocol-file (File.getProtocolOutput) I've
> added a special case for 304:
> 
> if (code == 304) { // got a not modified response
>     return new ProtocolOutput(response.toContent(),
>       ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
>         }
> 
> I suppose this is NOT the right solution :-)
> Anyway, this is another problem I have with protocol-file. I have the seed:
> 
> file://localhost/tmp/files/
> 
> this directory contains a couple of files, aa.txt and bbbbb.txt
> If a file is deleted, recrawl, readded, it is ignored. I mean:
> 
> ./nutch crawl urls -depth 2 -topN 5
> rm /tmp/files/bbbbb.txt
> ./nutch crawl urls -depth 2 -topN 5
> echo "saaaszzz" >/tmp/files/bbbbb.txt
> ./nutch crawl urls -depth 2 -topN 5
> 
> ...
> Skipping file://localhost/tmp/files/bbbbb.txt; different batch id (null)
> ...
> 
> and the dump sticks with
> 
> ...
> baseUrl:        file://localhost/tmp/files/bbbbb.txt
> status: 1 (status_unfetched)
> ...
> protocolStatus: EXCEPTION, args=[org.apache.nutch.protocol.file.FileError:
> File Error: 404]
> 
> 
> 
> what am I doing wrong?
> 
> Thanks a lot!
> 
> 
> 
> 
> On Thu, Nov 15, 2012 at 7:25 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>> Hi Cesare,
>>
>> hmhh... Good catch!
>>
>> The modifiedTime is also set in CrawlDbReducer.reduce
>> right after FetchSchedule.setFetchSchedule is called and the signature
>> hasn't changed compared to the previous fetch, cf. NUTCH-1341.
>>
>> At a first glance, it looks like the modifiedTime is indeed never set
>> with DefaultFetchSchedule.
>> I'll have a more detailed look at this and come back soon.
>>
>> Thanks,
>> Sebastian
>>
>> On 11/15/2012 12:33 PM, Cesare Zavattari wrote:
>>> Hi all,
>>> the AdaptiveFetchSchedure has the following line:
>>>
>>> if (modifiedTime <= 0) modifiedTime = fetchTime;
>>>
>>> that DefaultFetchSchedule has not. This seems to
>>> prevent DefaultFetchSchedule handle correctly possible 403 responses
>>> (modifiedTime seems to be always zero and HttpRequest.java doesn't
>>> set If-Modified-Since request part).
>>>
>>> This is true for both nutch 1.x and 2.x.
>>>
>>> Is this the expected behaviour?
>>>
>>> Thanks
>>> Bye
>>>
>>
>>
> 
> 


Mime
View raw message