nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2281) Support non-default FileSystem
Date Thu, 06 Apr 2017 10:53:41 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958729#comment-15958729
] 

Hudson commented on NUTCH-2281:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch-trunk #3420 (See [https://builds.apache.org/job/Nutch-trunk/3420/])
NUTCH-2281 Support non-default FileSystem (snagel: [https://github.com/apache/nutch/commit/faed27af5b2c471610af93e2cb45f551615bd922])
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
* (edit) src/java/org/apache/nutch/segment/SegmentMerger.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbMerger.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDb.java
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/NodeReader.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/crawl/LinkDb.java
* (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* (edit) src/java/org/apache/nutch/parse/ParseSegment.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
* (edit) src/java/org/apache/nutch/tools/FileDumper.java
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbReader.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java
* (edit) src/java/org/apache/nutch/util/LockUtil.java
Adapt NUTCH-2336 to NUTCH-2281 (snagel: [https://github.com/apache/nutch/commit/330532175f751e7c977fb8549c048fc9cf4bd10d])
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java
NUTCH-2281 Support non-default file system - fix install of CrawlDb for (snagel: [https://github.com/apache/nutch/commit/5dcd7b13f450561a7b34bb6761041150c84bfdab])
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> Support non-default FileSystem
> ------------------------------
>
>                 Key: NUTCH-2281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2281
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>             Fix For: 1.14
>
>
> If a path (input or output) does not belong to the configured default FileSystem various
Nutch tools may raise an exception like
> {noformat}
>   Exception in ... java.lang.IllegalArgumentException: Wrong FS: s3a://..., expected:
hdfs://...
> {noformat}
> This is fixed by getting a reference to the FileSystem from the Path object
> {noformat}
>   FileSystem fs = path.getFileSystem(getConf());
> {noformat}
> instead of
> {noformat}
>   FileSystem fs = FileSystem.get(getConf());
> {noformat}
> A given path (e.g., {{s3a://...}}) may not belong to the default file system ({{hdfs://}}
or {{file://}} in local mode) and simple checks such as {{fs.exists(path)}} then will fail.
Cf. [FileSystem.checkPath(path)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#checkPath(org.apache.hadoop.fs.Path)],
and [FileSystem.get(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(org.apache.hadoop.conf.Configuration)]
vs. [FileSystem.get(URI,conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,%20org.apache.hadoop.conf.Configuration)]
which is called by [Path.getFileSystem(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/Path.html#getFileSystem%28org.apache.hadoop.conf.Configuration%29].
 
> Note that the FileSystem for input and output may be different, e.g., read from HDFS
and write to S3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message