nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2635) Generator writes unneeded temporary output
Date Thu, 16 Aug 2018 19:27:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582971#comment-16582971
] 

ASF GitHub Bot commented on NUTCH-2635:
---------------------------------------

sebastian-nagel opened a new pull request #376: NUTCH-2635 Generator writes unneeded temporary
output
URL: https://github.com/apache/nutch/pull/376
 
 
   - output is written to MultipleOutputs, skip context.write(...)
   - fix comment wrapping

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Generator writes unneeded temporary output
> ------------------------------------------
>
>                 Key: NUTCH-2635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2635
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> Generator writes the temporary output of the Selector job/step twice (see [line 516|https://github.com/apache/nutch/blob/branch-1.15/src/java/org/apache/nutch/crawl/Generator.java#L516]).
Not a big issue when generating small fetch lists but may be when working on large data. The
temporary output looks like:
> {noformat}
> % tree -h generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/
> enerate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/
> |-- [4.0K]  fetchlist-1
> |   `-- [ 25M]  part-r-00000
> `-- [ 77M]  part-r-00000
> 1 directory, 2 files
> % file generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/part-r-00000 
> generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/part-r-00000: ASCII text
> % file generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/fetchlist-1/part-r-00000 
> generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/fetchlist-1/part-r-00000: Apache Hadoop
Sequence file version 6
> {noformat}
> The unneeded output is plain-text which explains its larger size compared to the Hadoop
Sequence file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message