nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Naegele (JIRA)" <>
Subject [jira] [Created] (NUTCH-2279) LinkRank fails when using Hadoop MR output compression
Date Tue, 14 Jun 2016 19:34:09 GMT
Joseph Naegele created NUTCH-2279:

             Summary: LinkRank fails when using Hadoop MR output compression
                 Key: NUTCH-2279
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.11
            Reporter: Joseph Naegele

When using MapReduce job output compression, i.e. {{mapreduce.output.fileoutputformat.compress=true}},
LinkRank can't read the results of its {{Counter}} MR job due to the additional, generated
file extension.

For example, using the default compression codec (which appears to be DEFLATE), the counter
file is written to {{crawl/webgraph/_num_nodes_/part-00000.deflate}}. Then, the LinkRank job
attempts to manually read this file to obtain the number of links using the following code:
FSDataInputStream readLinks = Path(numLinksPath, "part-00000"));
which fails because the file {{part-00000}} doesn't exist:
LinkAnalysis: File crawl/webgraph/_num_nodes_/part-00000 does
not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
        at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(
        at org.apache.nutch.scoring.webgraph.LinkRank.analyze(
        at org.apache.nutch.scoring.webgraph.LinkRank.main(

To reproduce, add {{-D mapreduce.output.fileoutputformat.compress=true}} to the properties
for {{bin/nutch linkrank ...}}

This message was sent by Atlassian JIRA

View raw message