lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Slow indexing speed when collection size is large
Date Sun, 07 May 2017 13:14:23 GMT
On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> For my rich documentation handling, I'm using Extracting Request Handler, and it requires
OCR.
>
> However, currently, for the slow indexing speed which I'm experiencing, the indexing
is done directly from the Sybase database. I will fetch about 1000 records at a time from
Sybase, and stored in into a CacheRowSet for it to be indexed. The query to the Sybase database
is quite fast, and most of the time is spend on processes in the CacheRowSet.
<snip>
> A) 384 GB
<snip>
> A) 22 GB
<snip>
> A) 5 TB
<snip>
> A) A virtual machine with Sybase database is running on the server

The discussion about the drawbacks of the Extracting Request Handler has
already taken place.  Tika should be running on separate hardware, not
embedded in Solr.  Having high-impact Tika processing run on the Solr
server is going to slow everything down.

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?

As soon as you mention virtual machines, my mental picture of the setup
becomes much less clear.  You'll need to fully describe the OS and
hardware setup, at both the hypervisor and virtual machine level.  Then
I will know what questions to ask for more detailed information.

Is Solr in a virtual machine?
Is the 384GB at the hypervisor level, or the virtual machine level?
Is the 22GB heap the total heap memory, or is that per Solr instance?

If the 5TB is Solr index data, then there's no way you're going to get
fast performance.  Putting enough memory in one machine to effectively
cache that much data is impractically expensive, and most server
hardware doesn't have enough memory slots even if you do have the
money.  384GB wouldn't be enough for 5TB of index, and that's not even
taking into account the memory needed by your software, including Solr
and Sybase.

Thanks,
Shawn


Mime
View raw message