nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Baclace (JIRA)" <>
Subject [jira] Commented: (NUTCH-159) Specify temp/working directory for crawl
Date Tue, 10 Jan 2006 23:23:21 GMT
    [ ] 

Paul Baclace commented on NUTCH-159:

mapred.temp.dir and mapred.local.dir  are used for different purposes.

I think this is a sysadmin useability bug that really means:

1. defaults for these settings should be documented (of course)
2. it should be clear whether a path is abstract (applies to NDFS or local FS depending on or local FS only, or NDFS-only (if any).  Config attribute names should consistently
indicate this.
2. some clues as to how much space might be needed (some of this is in transition, however).
3. when the space is exhausted, the error message should indicate the path(s) in question
and config param that is used to specify it.

Separately, I am preparing a patch that will do (3) for mapred.local.dir

> Specify temp/working directory for crawl
> ----------------------------------------
>          Key: NUTCH-159
>          URL:
>      Project: Nutch
>         Type: Bug
>   Components: fetcher, indexer
>     Versions: 0.8-dev
>  Environment: Linux/Debian
>     Reporter: byron miller

> I ran a crawl of 100k web pages and got:
> org.apache.nutch.fs.FSError: No space left on device
>         at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(
>         at org.apache.nutch.fs.FileUtil.copyContents(
>         at org.apache.nutch.fs.LocalFileSystem.renameRaw(
>         at org.apache.nutch.fs.NutchFileSystem.rename(
>         at org.apache.nutch.mapred.LocalJobRunner$
> Caused by: No space left on device
>         at Method)
>         at
>         at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(
>         ... 4 more
> Exception in thread "main" Job failed!
>         at org.apache.nutch.mapred.JobClient.runJob(
>         at org.apache.nutch.crawl.Fetcher.fetch(
>         at org.apache.nutch.crawl.Crawl.main(
> byron@db02:/data/nutch$ df -k
> It appears crawl created a /tmp/nutch directory that filled up even though i specified
a db directory.
> Need to add a parameter to the command line or make a globaly configurable /tmp (work
area) for the nutch instance so that crawls won't fail.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message