nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy Galema <ferdy.gal...@kalooga.com>
Subject hadoop.job.history.user.location in nutch-default with CDH rendering job history useless
Date Tue, 07 Aug 2012 09:21:23 GMT
Hi,

There still is a property in nutch-default
'hadoop.job.history.user.location' that redirects the creation of history
files from job output locations to a custom location. I noticed that the
current value does not work well with CDH, because ${hadoop.log.dir} is not
defined. This actually causes the entire job history in the jobtracker to
show empty info. (With 'incomplete' job status).

Changing the value to /user/myname/history does work for example. However I
have done some more testing and it seems that this property can be set to
'none', because the job history is ALSO stored in the central jobtracker
location anyway. The 'hadoop.job.history.user.location' property specifies
an extra location. But if it is set to an invalid value, it causes the
central history location to NOT store it. Please see for more details:
http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html

Setting this value to 'none' keeps the central history but prevents the job
to write history in the job output location. If a user wants to have an
extra copy of the history files, nothing prevents him/her from specifying
another value in nutch-site for example. Another option is to set it to
'history' which does work with CDH. (This writes all logs to 'history' in
the user directory in the configured filesystem, usually dfs). The final
option is to simply remove this value and not meddle with hadoop properties
at all. But that actually requires all jobs to correctly ignore these
files. I am not up to date how well this currently works with Nutch jobs.
This question is most relevant for trunk, since trunk heavily relies on the
filesystem for jobs.

What do you think? It would be great if anyone could do some testing with
trunk and possible another Hadoop distro. (I.e. the official 1.0.3). Then
we have some more input to decide what the best option is:
A) Set property to 'none'
B) Set property to 'history'
C) Remove property, see what happens, possibly fix jobs
D) ?

Ferdy.

Mime
View raw message