nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
Date Wed, 25 Apr 2012 21:26:18 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262121#comment-13262121
] 

Lewis John McGibbney commented on NUTCH-1340:
---------------------------------------------

Hi Ferdy. I am +1 for this going into 2.0. If you could do your usual and provide a small
Javadoc comment for the new method you introduce that would be great. 
                
> Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1340
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1340
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1340-v1.txt
>
>
> After applying GORA-120 (this already is a huge performance boost by itself) one of the
major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer
simply sets every row to delete its markers. A lot of rows do not actually have the markers
but the deletes are fired away in any case. Because the markers are already always on the
input, a simple check to see if they exist greaty improves performance.
> In particular it is very expensive in HBase, because every single Delete inmediately
triggers a connection to the regionservers. (They ignore the "autoflush=false" directive).
Although deletes can be done in batch, this is currently not supported by Gora. For one it
is very difficult to implement in the current HBaseStore with regard to multithreading, and
secondly I noticed performance did not increase significantly.
> By performance debugging on a real life cluster this currently seems to be the biggest
bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message