nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "misc" <m...@robotgenius.net>
Subject Two suggestions
Date Sat, 06 Oct 2007 01:25:56 GMT

Hello All-

    Two suggested (small) changes:

Change 1

    Use case: Want a list of all ".mov" files found during crawl, don't want 
to actually download them and store in the content database (too much 
bandwidth and space!).

    Partial solution: filter out with regex-urlfilter.  Problem is, no 
record of this url being parsed is stored anywhere

    Full proposed solution: Change code in ParseOutputFormat from

(line 173)

    toUrl = filters.filter(toUrl);   // filter the url
              if (toUrl == null) {
                continue;
              }

to (the new line 173)

    if (filters.filter(toUrl) == null)   // filter the url
                  {
                      LOG.debug("filtering out " + toUrl);
                      continue;
                  }

    This way, all filtered out URLs can be saved if the log level is changed 
to debug.  This is also useful to verify that stuff isn't accidentally 
getting trown away in a parse.

Change 2

    Add pdf the the default regex-urlfilter removal list.  There doesn't 
seem to be any pdf parser (yet), and my output logs are filled with errors 
about this.

                        thanks
                            -Jim


Mime
View raw message