nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriele Kahlout <gabri...@mysimpatico.com>
Subject Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling
Date Sun, 27 Mar 2011 12:44:49 GMT
P.S.
I'm still modifying.

On Sun, Mar 27, 2011 at 2:34 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Gabriele,
>
>  I think it is a good idea to have a script like this however your proposal
> could be improved. It currently works only on a single machine and uses
> commands such as mv, ls etc... which won't work on a pseudo or fully
> distributed cluster. You should use the 'hadoop fs' commands instead.
>

Okay, let's go for 3 editions:
1. that's abridged and works only with solr (tersest script)
2. unabridged with local fs  - for begginners
3. hadoop unabridged


> If I understand the script correctly, you then merge different crawldbs.
> Why do you do that? There should be one crawldb per crawl so I don't think
> this is at all necessary.
>
> So that I get a single dump with info about all the urls crawled. On the
scale of the web this is probably a bad idea, isn't it? But then how else
could you inspect all the crawled urls at once?


> Having a script would definitely be a plus for beginners and would give
> more flexibility than the crawl command.
>

I as the first of beginners. Crawl is not recommended for whole-web crawling
i guess because it doesn't work incrementally. Why not add such option to
crawl? Shall I feature-request/patch for that?

Thanks
>
> Julien
>
> P.S. I'm still modying the page.


> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Mime
View raw message