nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "CommonCrawlDataDumper" by darrencheng
Date Wed, 01 Apr 2015 17:08:18 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "CommonCrawlDataDumper" page has been changed by darrencheng:

  bin/nutch commoncrawldump -outputDir outCommonCrawl -segment testCrawl/segments
+ If when you start running the script later you start getting an error called {{{OutOfMemoryError}}},
try changing the JAVA_HEAP_MAX variable in line 128 of {{{bin/nutch}}} to an appropriate value.

  The {{{bin/nutch commoncrawldump}}} program dumps out all Nutch segments included in {{{testCrawl/segments}}}
to {{{outCommonCrawl}}} folder, making one CBOR-encoded file for each crawled file. The tool
will show a short report as follows:

View raw message