K, hadoopized the script, though i've tried it only locally.
I rethought (lazyness convinced me) not to include the indexer parameter.
On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche <firstname.lastname@example.org> wrote:
you don't need to have 2 and 3. The hadoop commands will work on the local fs in a completely transparent way, it all depends on the way hadoop is configured. It isolates the way data are stored (local or distrib) from the client code i.e Nutch. By adding a separate script using fs, you'd add more confusion and lead beginners to think that they HAVE to use fs.
I apologize for not having yet looked into hadoop in detail but I had understood that it would abstract over the single machine fs.
No problems. It would be worth spending a bit of time reading about Hadoop if you want to get a better understanding of Nutch. Tom White's book is an excellent reference but the wikis and tutorials would be a good start
However, to get up and running after downloading nutch will the script just work or will I have to configure hadoop? I assume the latter.
Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API for getting its inputs, so when you run it as you did what actually happens is that you are getting the data from the local FS via Hadoop.
I'll look into it and update the script accordingly.From a beginner prospective I like to reduce the magic (at first) and see through the commands, and get up and running asap.
Hence 2. I'll be using 3.
Hadoop already reduces the magic for you :-)
Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of the hadoop cmds and get rid of 2.
As for the legacy-lucene vs SOLR what about having a parameter to determine which one should be used and have a single script?
Excellent idea. The default is solr for 1 and 3, but one passes parameter 'll' it will use the legacy lucene. The default for 2 is ll since we want to get up and running fast (before knowing what solr is and set it up).
It would be nice to have a third possible value (i.e. none) for the parameter -indexer (besides solr and lucene). A lot of people use Nutch as a crawling platform but do not do any indexing
agreed. Will add that too.
Why do you want to get the info about ALL the urls? There is a readdb -stats command which gives an summary of the content of the crawldb. If you need to check a particular URL or domain, just use readdb -url and readdb -regex (or whatever the name of the param is)At least when debugging/troubleshooting I found it useful to see which urls were fetched and the responses (robot_blocked, etc..).
I can do that examining each $it_crawlddb in turn, since i don't know when that url was fetched (although since the fetching is pretty linear I could also find out, sth. like index in seeds/urls / $it_size.
better to do that by looking at the content of the segments using 'nutch readseg -dump' or using 'hadoop fs -libjars nutch.job segment/SEGMENTNUM/crawl_data' for instance. That's probably not something that most people will want to do so maybe comment it out in your script?
running hadoop in peudo distributed mode and looking at the hadoop web guis (http://localhost:50030) gives you a lot of information about your crawl
It would definitely be better to have a single crawldb in your script.
agreed, maybe again an option and the default is none. But keep every $it_crawldb instead of deleting and merging them.
I should be looking into the necessary Hadoop today and start updating the script accordingly.
--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this).
If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).