nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
Date Fri, 29 May 2009 02:42:45 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714277#action_12714277
] 

Ken Krugler commented on NUTCH-739:
-----------------------------------

There's another approach that works well here, and that's to start up a thread that calls
the Hadoop reporter while the optimize is happening.

We ran into the same issue when optimizing large Lucene indexes from our Bixo IndexScheme
tap for Cascading. You can find that code on GitHub, but the skeleton is to do something like
this in the reducer's close() method - assuming you've stashed the reporter from the reduce()
call:

{code:java}
// Hadoop needs to know we still working on it.
Thread reporterThread = new Thread() {
	public void run() {
		while (!isInterrupted()) {
			reporter.progress();
			try {
				sleep(10 * 1000);
			} catch (InterruptedException e) {
				interrupt();
			}
		}
	}
};
reporterThread.start();

indexWriter.optimize();
<and other lengthy tasks here>
reporterThread.interrupt();
{code}



> SolrDeleteDuplications too slow when using hadoop
> -------------------------------------------------
>
>                 Key: NUTCH-739
>                 URL: https://issues.apache.org/jira/browse/NUTCH-739
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>         Environment: hadoop cluster with 3 nodes
> Map Task Capacity: 6
> Reduce Task Capacity: 6
> Indexer: one instance of solr server (on the one of slave nodes)
>            Reporter: Dmitry Lihachev
>             Fix For: 1.1
>
>         Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch
>
>
> in my environment i always have many warnings like this on the dedup step
> {noformat}
> Task attempt_200905270022_0212_r_000003_0 failed to report status for 600 seconds. Killing!
> {noformat}
> solr logs:
> {noformat}
> INFO: [] webapp=/solr path=/update params={wt=javabin&waitFlush=true&optimize=true&waitSearcher=true&maxSegments=1&version=2.2}
status=0 QTime=173741
> May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish
> INFO: {optimize=} 0 173599
> May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update params={wt=javabin&waitFlush=true&optimize=true&waitSearcher=true&maxSegments=1&version=2.2}
status=0 QTime=173599
> May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close
> INFO: Closing Searcher@2ad9ac58 main
> May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo
> WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher
> org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
> ....
> {noformat}
> So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize()
). Because we have few job tasks each of ones tries to optimize solr indexes before closing.
> The simplest way to avoid this bug - removing this line and sending "<optimize/>"
message directly to solr server after dedup step

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message