nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriele Kahlout <>
Subject Re:
Date Sun, 27 Mar 2011 12:44:49 GMT
I'm still modifying.

On Sun, Mar 27, 2011 at 2:34 PM, Julien Nioche <> wrote:

> Gabriele,
>  I think it is a good idea to have a script like this however your proposal
> could be improved. It currently works only on a single machine and uses
> commands such as mv, ls etc... which won't work on a pseudo or fully
> distributed cluster. You should use the 'hadoop fs' commands instead.

Okay, let's go for 3 editions:
1. that's abridged and works only with solr (tersest script)
2. unabridged with local fs  - for begginners
3. hadoop unabridged

> If I understand the script correctly, you then merge different crawldbs.
> Why do you do that? There should be one crawldb per crawl so I don't think
> this is at all necessary.
> So that I get a single dump with info about all the urls crawled. On the
scale of the web this is probably a bad idea, isn't it? But then how else
could you inspect all the crawled urls at once?

> Having a script would definitely be a plus for beginners and would give
> more flexibility than the crawl command.

I as the first of beginners. Crawl is not recommended for whole-web crawling
i guess because it doesn't work incrementally. Why not add such option to
crawl? Shall I feature-request/patch for that?

> Julien
> P.S. I'm still modying the page.

> --
> *
> *Open Source Solutions for Text Engineering

K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈

View raw message