nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2281) Support non-default FileSystem
Date Tue, 21 Jun 2016 12:35:57 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341680#comment-15341680
] 

Sebastian Nagel commented on NUTCH-2281:
----------------------------------------

I tried to fix all tools but haven't tested all of them yet.  Yes, there may be some I've
overseen :(.  I didn't fix unit tests, rarely used tools (Benchmark, DmozParser) and some
main() methods which are intended for debugging or explicitly take the file system as argument
(ParseData, ParseText).  I'll continue testing the next days but help is welcome!

> Support non-default FileSystem
> ------------------------------
>
>                 Key: NUTCH-2281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2281
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>             Fix For: 1.13
>
>
> If a path (input or output) does not belong to the configured default FileSystem various
Nutch tools may raise an exception like
> {noformat}
>   Exception in ... java.lang.IllegalArgumentException: Wrong FS: s3a://..., expected:
hdfs://...
> {noformat}
> This is fixed by getting a reference to the FileSystem from the Path object
> {noformat}
>   FileSystem fs = path.getFileSystem(getConf());
> {noformat}
> instead of
> {noformat}
>   FileSystem fs = FileSystem.get(getConf());
> {noformat}
> A given path (e.g., {{s3a://...}}) may not belong to the default file system ({{hdfs://}}
or {{file://}} in local mode) and simple checks such as {{fs.exists(path)}} then will fail.
Cf. [FileSystem.checkPath(path)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#checkPath(org.apache.hadoop.fs.Path)],
and [FileSystem.get(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(org.apache.hadoop.conf.Configuration)]
vs. [FileSystem.get(URI,conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,%20org.apache.hadoop.conf.Configuration)]
which is called by [Path.getFileSystem(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/Path.html#getFileSystem%28org.apache.hadoop.conf.Configuration%29].
 
> Note that the FileSystem for input and output may be different, e.g., read from HDFS
and write to S3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message