nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michela Becchi" <mbec...@nec-labs.com>
Subject Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
Date Tue, 18 May 2010 14:18:23 GMT
Hello,

 

I am performing a local file system crawling.

My problem is the following: all files that contain some hexadecimal
characters in the name do not get crawled.

 

For example, I will see the following error:

 

fetching
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a0
9.html

org.apache.nutch.protocol.file.FileError: File Error: 404

        at
org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)

        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)

fetch of
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a0
9.html failed with: org.apache.nutch.protocol.file.FileError: File
Error: 404

 

I am using nutch-1.0.

 

Among other standard settings, I configured nutch-site.conf as follows:

 

<property>

  <name>plugin.includes</name>

 
<value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|p
df)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summ
ary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

  <description>Regular expression naming plugin directory names to

  include.  Any plugin not matching this expression is excluded.

  In any case you need at least include the nutch-extensionpoints
plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

  and basic indexing and search plugins. In order to use HTTPS please
enable

  protocol-httpclient, but be aware of possible intermittent problems
with the

  underlying commons-httpclient library.

  </description>

</property>

 

<property>

  <name>file.content.limit</name>

  <value>-1</value>

</property>

 

Moreover, crawl-urlfilter.txt   looks like:

 

# skip http:, ftp:, & mailto: urls

-^(http|ftp|mailto):

 

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

 

# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

 

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

 

# accept hosts in MY.DOMAIN.NAME

#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

 

# accept everything else

+.*

~    

 

---

 

Thanks,

 

Michela

 

 


Mime
View raw message