On Tue, Nov 20, 2012 at 12:12 AM, Sebastian Nagel <wastl.nagel@googlemail.com> wrote:
Hi Cesare,

Ciao Sebastian and thanks for your email.
 
> modifiedTime = fetchTime;
> instead of:
> if (modifiedTime <= 0) modifiedTime = fetchTime;
This will always overwrite modified time with the time the fetch took place.
I would prefer the way as it's done in AdaptiveFetchSchedule:
only set modifiedTime if it's unset (=0).

here's my problem:

- you fetch a page XXX the first time
- modifiedTime is 0, so it's set to fetchTime
- from now on I'll get 304...
- ... unless XXX changes
- modifiedTime will never be changed and I'll never get 304 again, page will be always fetched (200) because If-Modified-Since will always be true

this is why I always set modifiedTime. We could skip it if status is NOTMODIFIED.

The same issue seems to affect AdaptiveFetchSchedule

> I don't know if this is correct (probably not) but at least 304 seems to be
> handled. In particular, in the protocol-file (File.getProtocolOutput) I've
> added a special case for 304:
>
> if (code == 304) { // got a not modified response
>     return new ProtocolOutput(response.toContent(),
>       ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
>         }
>
> I suppose this is NOT the right solution :-)
At a first glance, it's not bad. Protocol-file needs obviously a revision:
the 304 is set properly in FileResponse.java but in File.java it is treated as
redirect:
   else if (code >= 300 && code < 400) { // handle redirect
So, thanks. Good catch!

Would be great if you could open Jira issues for
- setting modified time in DefaultSchedule
- 304 handling in protocol-file
If you can provide patches, even better. Thanks!

I want to be sure about the right solution for setting modifiedTime properly.

About your problem with removal / re-adding files:
- a file system is crawled as if linked web pages:
  a directory is just an HTML page with all files and sub-directories
  as links.

this is clear. Let's consider a page A that links a page B:

A -> B

A is seed. I use the following command:

./nutch crawl urls -depth 2 -topN 5

we crawl it. Ok.
Now let's remove page B.

./nutch crawl urls -depth 2 -topN 5

B gets a 404. Fine.

now let's restore B and crawl again.

This works as expected if A and B are html pages (B is fetched by "./nutch crawl"). If A is a directory and B is a file, B will never be fetched again. Moreover, in this case A get a 200 because a new file is added, so the pasing/generate phases should force the refetch of B, isn't it?

Reproducing it is easy:

mkdir /tmp/files/
echo "AAA" >/tmp/files/aa.txt

the only seed is file://localhost/tmp/files/

./nutch crawl urls -depth 2 -topN 5    // both /tmp/files/ and /tmp/files/aa.txt are get
rm /tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5    // /tmp/files/aa.txt gets a 404
echo "AAA" >/tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5    // /tmp/files/ has changed, is get (200) while aa.txt:

ParserJob: parsing all
Parsing file://localhost/tmp/files/
Skipping file://localhost/tmp/files/aa.txt; different batch id (null)

and is never fetched again, despite the page that links it (the directory) has changed.

is this the expected behavior?

thanks a lot

-- 
Cesare