nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2463) Enable sampling CrawlDB
Date Tue, 28 Nov 2017 11:07:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268573#comment-16268573
] 

ASF GitHub Bot commented on NUTCH-2463:
---------------------------------------

YossiTamari commented on issue #243: NUTCH-2463 - Enable sampling CrawlDB
URL: https://github.com/apache/nutch/pull/243#issuecomment-347489658
 
 
   Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Enable sampling CrawlDB
> -----------------------
>
>                 Key: NUTCH-2463
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2463
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>            Reporter: Yossi Tamari
>            Priority: Minor
>             Fix For: 1.14
>
>
> CrawlDB can grow to contain billions of records. When that happens *readdb -dump* is
pretty useless, and *readdb -topN* can run for ages (and does not provide a statistically
correct sample).
> We should add a parameter *-sample* to *readdb -dump* which is followed by a number between
0 and 1, and only that fraction of records from the CrawlDB will be processed.
> The sample should be statistically random, and all the other filters should be applied
on the sampled records.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message