nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable
Date Fri, 01 Jun 2007 09:14:16 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500635
] 

Andrzej Bialecki  commented on NUTCH-392:
-----------------------------------------

Good point. We can change it to use the following pattern (as Hadoop uses internally), e.g.:

contentOut = new MapFile.Writer(job, fs, content.toString(), Text.class, Content.class, SequenceFile.getCompressionType(job),
progress);

However, the original patch had some merits, too. Some types of data are not that compressible
in themselves (using RECORD compression), i.e. it takes more effort to compress/decompress
than space savings are worth. In case of crawl_parse and crawl_fetch it would make sense to
enforce BLOCK or NONE compression type, and disallow the RECORD type.

 I know that BLOCK compression gives a better space savings, and incidentally may increase
the writing speed. But I'm not sure what is the performance impact of using BLOCK compressed
MapFile-s when doing random reading - this is the scenario in LinkDbInlinks, FetchedSegments
and similar places. Could you perhaps test it? The original patch used RECORD compression
for MapFile-s, probably for this reason.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying
SequenceFile implementations.  This will keep reduce tasks from timing out when block writes
are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message