nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Resolved: (NUTCH-546) file URL are filtered out by the crawler
Date Mon, 10 Sep 2007 19:47:29 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doğacan Güney resolved NUTCH-546.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

Committed in rev. 574346.

Note that UrlValidator is now a plugin (urlfilter-validator) and is not enabled by default.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-546-validator-plugin_v1.patch, NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version
0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority',
a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters
to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't
URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message