nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Padiasek (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1790) solrdedup in local mode causes OutOfMemoryError in Solr
Date Sat, 31 May 2014 16:16:02 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Greg Padiasek updated NUTCH-1790:
---------------------------------

    Description: 
Nutch 1.7 and 2.2.1 use Hadoop 1.2. In this version Hadoop overwrites "mapred.map.tasks" variable
set in mapred-site.xml and in local mode always sets it to 1. As a result Nutch creates a
giant query to read ALL Solr documents at once. This in turn causes Solr to consume all RAM
given number of documents is high. I found this issue with Solr running with 2M+ docs, 1GB
JVM RAM, 20% of which is used under normal conditions. When running "solrdedup", memory usage
exceeds available RAM, solr throws OutOfMemoryError and the dedup job fails.

I think this could be solved in one of two ways. Either by upgrading Nutch to a later version
of Hadoop lib (which hopefully does not hard-coded "mapred.map.tasks" value anymore), or changing
the SolrDeleteDuplicates class to "stream" documents in batches. The later would make Nutch
less dependent on Hadoop version and this was my choice. Attached is a patch that implements
batch reading in local mode with user defined batch size. The "streaming" is potentially also
applicable in distributed mode.

  was:
Nutch 1.7 and 2.2.1 use Hadoop 1.2. In this version Hadoop overwrites "mapred.map.tasks" variable
set in mapred-site.xml and in local mode always sets it to 1. As a result Nutch creates a
giant query to read ALL Solr documents at once. This in turn causes Solr to consume all RAM
given number of documents is high. I found this issue with Solr running with 2M+ docs, 1GB
JVM RAM, 20% of which is used under normal conditions. When running "solrdedup", memory usage
exceeds available RAM, solr throws OutOfMemoryError and the dedup job fails.

Think this could be solved in one of two ways. Either by upgrading Nutch to a later version
of Hadoop lib (which hopefully does not hard-coded "mapred.map.tasks" value anymore), or changing
the SolrDeleteDuplicates class to "stream" documents in batches. The later would make Nutch
less dependent on Hadoop version and this was my choice. Attached is a patch that implements
batch reading in local mode with user defined batch size. The "streaming" is potentially also
applicable in distributed mode.


> solrdedup in local mode causes OutOfMemoryError in Solr
> -------------------------------------------------------
>
>                 Key: NUTCH-1790
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1790
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.7, 2.2
>         Environment: Nutch in local mode.
>            Reporter: Greg Padiasek
>
> Nutch 1.7 and 2.2.1 use Hadoop 1.2. In this version Hadoop overwrites "mapred.map.tasks"
variable set in mapred-site.xml and in local mode always sets it to 1. As a result Nutch creates
a giant query to read ALL Solr documents at once. This in turn causes Solr to consume all
RAM given number of documents is high. I found this issue with Solr running with 2M+ docs,
1GB JVM RAM, 20% of which is used under normal conditions. When running "solrdedup", memory
usage exceeds available RAM, solr throws OutOfMemoryError and the dedup job fails.
> I think this could be solved in one of two ways. Either by upgrading Nutch to a later
version of Hadoop lib (which hopefully does not hard-coded "mapred.map.tasks" value anymore),
or changing the SolrDeleteDuplicates class to "stream" documents in batches. The later would
make Nutch less dependent on Hadoop version and this was my choice. Attached is a patch that
implements batch reading in local mode with user defined batch size. The "streaming" is potentially
also applicable in distributed mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message