From dev-return-26655-apmail-nutch-dev-archive=nutch.apache.org@nutch.apache.org Sat May 31 16:30:03 2014 Return-Path: X-Original-To: apmail-nutch-dev-archive@www.apache.org Delivered-To: apmail-nutch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E76481125F for ; Sat, 31 May 2014 16:30:02 +0000 (UTC) Received: (qmail 94240 invoked by uid 500); 31 May 2014 16:30:02 -0000 Delivered-To: apmail-nutch-dev-archive@nutch.apache.org Received: (qmail 94185 invoked by uid 500); 31 May 2014 16:30:02 -0000 Mailing-List: contact dev-help@nutch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@nutch.apache.org Delivered-To: mailing list dev@nutch.apache.org Received: (qmail 94178 invoked by uid 99); 31 May 2014 16:30:02 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 31 May 2014 16:30:02 +0000 Date: Sat, 31 May 2014 16:30:02 +0000 (UTC) From: "Greg Padiasek (JIRA)" To: dev@nutch.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (NUTCH-1790) solrdedup causes OutOfMemoryError in Solr MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/NUTCH-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Padiasek updated NUTCH-1790: --------------------------------- Environment: Nutch 1.7 in local mode. Solr 4.7 with 2M docs under Jetty with 1GB RAM. was: Nutch 1.7 in local mode. Solr 4.7 with 2M docs under Jetty with 2GB RAM. > solrdedup causes OutOfMemoryError in Solr > ----------------------------------------- > > Key: NUTCH-1790 > URL: https://issues.apache.org/jira/browse/NUTCH-1790 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.7, 2.2 > Environment: Nutch 1.7 in local mode. > Solr 4.7 with 2M docs under Jetty with 1GB RAM. > Reporter: Greg Padiasek > Attachments: SolrDeleteDuplicates.patch > > > Nutch 1.7 and 2.2.1 use Hadoop 1.2. In this version Hadoop overwrites "mapred.map.tasks" variable set in mapred-site.xml and in local mode always sets it to 1. As a result Nutch creates a giant query to read ALL Solr documents at once. This in turn causes Solr to consume all RAM given number of documents is high. I found this issue with Solr running with 2M+ docs, 1GB JVM RAM, 20% of which is used under normal conditions. When running "solrdedup", memory usage exceeds available RAM, solr throws OutOfMemoryError and the dedup job fails. > I think this could be solved in one of two ways: either by upgrading Nutch to a later version of Hadoop lib (which hopefully does not hard-coded "mapred.map.tasks" value anymore), or by changing the SolrDeleteDuplicates class to "stream" documents in batches. The later would make Nutch less dependent on Hadoop version and this was my choice. Attached is a patch that implements batch reading in local mode with user defined batch size. The "streaming" is potentially also applicable in distributed mode. -- This message was sent by Atlassian JIRA (v6.2#6252)