nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phoebe Miller (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-7) analyze tool takes up all the disk space when there are circular links
Date Fri, 11 Mar 2005 04:34:53 GMT
     [ http://issues.apache.org/jira/browse/NUTCH-7?page=history ]

Phoebe Miller updated NUTCH-7:
------------------------------

    Description: 
It is repeatable by running an instance with these seeds:
http://www.acf.hhs.gov/programs/ofs/forms.htm/grants/grants/grants/grants/data/grants/data/data/data/data/grants/data/grants/grants/grants/process.htm
http://www.acf.hhs.gov/programs/ofs/

and limit it (for best effect) to just:
*.acf.hhs.gov/*

Let it go for about 12 cycles to build it up and the temp file size roughly doubles with each
segment.

]$ ls -l /db/tmpdir2344la/
...

1503641425 Mar 10 17:42 scoreEdits.0.unsorted

for a very small db:

Stats for net.nutch.db.WebDBReader@89cf1e
-------------------------------
Number of pages: 6916
Number of links: 8085

scoreEdits.0.sorted.0 contains rows of links that looked like the first seed url, but with
more grants/ and data/ in the sub dirs.


In the File:
.DistributedAnalysisTool.java
 345                     if (curIndex - startIndex > extent) {
 346                         break;
 347                     }
is the hard stop.

Further down the score is written:
381  for (int i = 0; i < outLinks.length; i++) {
...
385     scoreWriter.append(outLinks[i].getURL(), score);

Putting a check here stops the tmpdir.../scoreEdits.0 file growth
but the links themselves should not be produced in the generation either.



  was:
It is repeatable by running an instance with these seeds:
http://www.acf.hhs.gov/programs/ofs/forms.htm/grants/grants/grants/grants/data/grants/data/data/data/data/grants/data/grants/grants/grants/process.htm
http://www.acf.hhs.gov/programs/ofs/

and limit it (for best effect) to just:
*.acf.hhs.gov/*

Let it go for about 12 cycles to build it up and the temp file size roughly doubles with each
segment.

]$ ls -l /db/tmpdir2344la/

...

1503641425 Mar 10 17:42 scoreEdits.0.unsorted

for a very small db:

Stats for net.nutch.db.WebDBReader@89cf1e
-------------------------------
Number of pages: 6916
Number of links: 8085
.




> analyze tool takes up all the disk space when there are circular links
> ----------------------------------------------------------------------
>
>          Key: NUTCH-7
>          URL: http://issues.apache.org/jira/browse/NUTCH-7
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>  Environment: analyze runs for an excessive amount of time and creates huge temp files
until it runs out of disk space (if you let the db grow)
>     Reporter: Phoebe Miller

>
> It is repeatable by running an instance with these seeds:
> http://www.acf.hhs.gov/programs/ofs/forms.htm/grants/grants/grants/grants/data/grants/data/data/data/data/grants/data/grants/grants/grants/process.htm
> http://www.acf.hhs.gov/programs/ofs/
> and limit it (for best effect) to just:
> *.acf.hhs.gov/*
> Let it go for about 12 cycles to build it up and the temp file size roughly doubles with
each segment.
> ]$ ls -l /db/tmpdir2344la/
> ...
> 1503641425 Mar 10 17:42 scoreEdits.0.unsorted
> for a very small db:
> Stats for net.nutch.db.WebDBReader@89cf1e
> -------------------------------
> Number of pages: 6916
> Number of links: 8085
> scoreEdits.0.sorted.0 contains rows of links that looked like the first seed url, but
with more grants/ and data/ in the sub dirs.
> In the File:
> .DistributedAnalysisTool.java
>  345                     if (curIndex - startIndex > extent) {
>  346                         break;
>  347                     }
> is the hard stop.
> Further down the score is written:
> 381  for (int i = 0; i < outLinks.length; i++) {
> ...
> 385     scoreWriter.append(outLinks[i].getURL(), score);
> Putting a check here stops the tmpdir.../scoreEdits.0 file growth
> but the links themselves should not be produced in the generation either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


Mime
View raw message