nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michela Becchi (JIRA)" <>
Subject [jira] Created: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
Date Thu, 20 May 2010 20:27:18 GMT
Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

                 Key: NUTCH-824
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.0.0
         Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 GNU/Linux
            Reporter: Michela Becchi
            Priority: Blocker


I am performing a local file system crawling.
My problem is the following: all files that contain some hexadecimal characters in the name
do not get crawled.

For example, I will see the following error:

fetching file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
org.apache.nutch.protocol.file.FileError: File Error: 404
        at org.apache.nutch.protocol.file.File.getProtocolOutput(
        at org.apache.nutch.fetcher.Fetcher$
fetch of file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html failed
with: org.apache.nutch.protocol.file.FileError: File Error: 404

I am using nutch-1.0.

Among other standard settings, I configured nutch-site.conf as follows:

  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.


Moreover, crawl-urlfilter.txt   looks like:

# skip http:, ftp:, & mailto: urls

# skip image and other suffixes we can't yet parse

# skip URLs containing certain characters as probable queries, etc.

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops

# accept hosts in MY.DOMAIN.NAME

# accept everything else




This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message