nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiran Chaudhuri (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2451) MalformedURLExceptions on perfectly looking URLs?
Date Tue, 07 Nov 2017 11:41:00 GMT


Hiran Chaudhuri commented on NUTCH-2451:

Let's assume no suitable URLStreamHandler is registered. 

The PluginRepository - as it carries my changes my proposed changes from NUTCH-2429 is registered
as URLStreamHanderFactory. So it definitely should be involved when the ftp:// URL is constructed.
Here either it finds a suitable URLStreamHandler that was provided from a plugin. Or otherwise
it falls back to the JVM default methods, which definitely can handle ftp:// URLs. The fact
that a suitable URLStreamHandler is either found by the URLStreamHandlerFactory or by the
JVM is evident as I just provided the ftp://nas URL, and nutch crawled successfully to find
the offending URL ftp://nas/MediaPC/usr/lib32/gconv/ It would not have worked
if FTP support were missing completely.

Upon further analysis I find that the stack trace, pointing to source code org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(
which boils down to
{{u = new URL(response.getHeader("Location"));}}
means the URL that gets constructed is not the FTP url we see in the output but the value
of a header, which may have not been set by the protocol-ftp plugin.

Therefore I do not agree that NUTCH-2429 could be related or even the cause for this problem.

> MalformedURLExceptions on perfectly looking URLs?
> -------------------------------------------------
>                 Key: NUTCH-2451
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.13
>         Environment: Ubuntu 16.04.3 LTS
> OpenJDK 1.8.0_131
> nutch 1.14-SNAPSHOT
> Synology RS816
>            Reporter: Hiran Chaudhuri
> I tried running Nutch on my Synology NAS. As SMB protocol is not contained in Nutch,
I turned on FTP service on the NAS and configured Nutch to crawl ftp://nas.
> The experience gives me varying results which seem to point to problems within Nutch.
However this may need further evaluation.
> As some files could not be downloaded and I could not see a good error message I changed
the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not only
return protocol status but send the full exception and stack trace to the logs:
> {{    } catch (Exception e) {
>     	LOG.warn("Could not get {}", url, e);
>       return new ProtocolOutput(null, new ProtocolStatus(e));
>     }
> }}
> With this modification I suddenly see such messages in the logfile:
> {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching ftp://nas/MediaPC/usr/lib32/gconv/
> 2017-10-25 22:09:32,147 WARN  org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/usr/lib32/gconv/
> 	at<init>(
> 	at<init>(
> 	at<init>(
> 	at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(
> 	at
> Caused by: java.lang.NullPointerException
> }}
> Please mind the URL was not configured from me. Instead it was obtained by crawling my
NAS. Also the URL looks perfectly fine to me. Even if the file did not exist I would not expect
a MalformedURLException to occur. Even more, using Firefox and the same authentication data
on the same URL retrieves the file successfully.
> How come Nutch cannot get the file?

This message was sent by Atlassian JIRA

View raw message