lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Solr on HDFS: increase in query time with increase in data
Date Fri, 16 Dec 2016 14:52:45 GMT
On 12/14/2016 11:58 AM, Chetas Joshi wrote:
> I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> the following config.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I have been ingesting data into Solr for the last 3 months. With increase
> in data, I am observing increase in the query time. Currently the size of
> my indices is 70 GB per shard (i.e. per node).

Query times will increase as the index size increases, but significant
jumps in the query time may be an indication of a performance problem. 
Performance problems are usually caused by insufficient resources,
memory in particular.

With HDFS, I am honestly not sure *where* the cache memory is needed.  I
would assume that it's needed on the HDFS hosts, that a lot of spare
memory on the Solr (HDFS client) hosts probably won't make much
difference.  I could be wrong -- I have no idea what kind of caching
HDFS does.  If the HDFS client can cache data, then you probably would
want extra memory on the Solr machines.

> I am using cursor approach (/export handler) using SolrJ client to get back
> results from Solr. All the fields I am querying on and all the fields that
> I get back from Solr are indexed and have docValues enabled as well. What
> could be the reason behind increase in query time?

If actual disk access is required to satisfy a query, Solr is going to
be slow.  Caching is absolutely required for good performance.  If your
query times are really long but used to be short, chances are that your
index size has exceeded your system's ability to cache it effectively.

One thing to keep in mind:  Gigabit Ethernet is comparable in speed to
the sustained transfer rate of a single modern SATA magnetic disk, so if
the data has to traverse a gigabit network, it probably will be nearly
as slow as it would be if it were coming from a single disk.  Having a
10gig network for your storage is probably a good idea ... but current
fast memory chips can leave 10gig in the dust, so if the data can come
from cache and the chips are new enough, then it can be faster than
network storage.

Because the network can be a potential bottleneck, I strongly recommend
putting index data on local disks.  If you have enough memory, the disk
doesn't even need to be super-fast.

> Has this got something to do with the OS disk cache that is used for
> loading the Solr indices? When a query is fired, will Solr wait for all
> (70GB) of disk cache being available so that it can load the index file?

Caching the files on the disk is not handled by Solr, so Solr won't wait
for the entire index to be cached unless the underlying storage waits
for some reason.  The caching is usually handled by the OS.  For HDFS,
it might be handled by a combination of the OS and Hadoop, but I don't
know enough about HDFS to comment.  Solr makes a request for the parts
of the index files that it needs to satisfy the request.  If the
underlying system is capable of caching the data, if that feature is
enabled, and if there's memory available for that purpose, then it gets
cached.

Thanks,
Shawn


Mime
View raw message