lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Potharaju <jspothar...@gmail.com>
Subject Re: solr multicore vs sharding vs 1 big collection
Date Sun, 02 Aug 2015 14:29:21 GMT
The document contains around 30 fields and have stored set to true for
almost 15 of them. And these stored fields are queried and updated all the
time. You will notice that the deleted documents is almost 30% of the
docs.  And it has stayed around that percent and has not come down.
I did try optimize but that was disruptive as it caused search errors.
I have been playing with merge factor to see if that helps with deleted
documents or not. It is currently set to 5.

The server has 24 GB of memory out of which memory consumption is around 23
GB normally and the jvm is set to 6 GB. And have noticed that the available
memory on the server goes to 100 MB at times during a day.
All the updates are run through DIH.

Every day at least once i see the following error, which result in search
errors on the front end of the site.

ERROR org.apache.solr.servlet.SolrDispatchFilter -
null:org.eclipse.jetty.io.EofException

>From what I have read these are mainly due to timeout and my timeout is set
to 30 seconds and cant set it to a higher number. I was thinking maybe due
to high memory usage, sometimes it leads to bad performance/errors.

My objective is to stop the errors, adding more memory to the server is not
a good scaling strategy. That is why i was thinking maybe there is a issue
with the way things are set up and need to be revisited.

Thanks


On Sat, Aug 1, 2015 at 7:06 PM, Shawn Heisey <apache@elyograg.org> wrote:

> On 8/1/2015 6:49 PM, Jay Potharaju wrote:
> > I currently have a single collection with 40 million documents and index
> > size of 25 GB. The collections gets updated every n minutes and as a
> result
> > the number of deleted documents is constantly growing. The data in the
> > collection is an amalgamation of more than 1000+ customer records. The
> > number of documents per each customer is around 100,000 records on
> average.
> >
> > Now that being said, I 'm trying to get an handle on the growing deleted
> > document size. Because of the growing index size both the disk space and
> > memory is being used up. And would like to reduce it to a manageable
> size.
> >
> > I have been thinking of splitting the data into multiple core, 1 for each
> > customer. This would allow me manage the smaller collection easily and
> can
> > create/update the collection also fast. My concern is that number of
> > collections might become an issue. Any suggestions on how to address this
> > problem. What are my other alternatives to moving to a multicore
> > collections.?
> >
> > Solr: 4.9
> > Index size:25 GB
> > Max doc: 40 million
> > Doc count:29 million
> >
> > Replication:4
> >
> > 4 servers in solrcloud.
>
> Creating 1000+ collections in SolrCloud is definitely problematic.  If
> you need to choose between a lot of shards and a lot of collections, I
> would definitely go with a lot of shards.  I would also want a lot of
> servers for an index with that many pieces.
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> I don't think it would matter how many collections or shards you have
> when it comes to how many deleted documents are in your index.  If you
> want to clean up a large number of deletes in an index, the best option
> is an optimize.  An optimize requires a large amount of disk I/O, so it
> can be extremely disruptive if the query volume is high.  It should be
> done when the query volume is at its lowest.  For the index you
> describe, a nightly or weekly optimize seems like a good option.
>
> Aside from having a lot of deleted documents in your index, what kind of
> problems are you trying to solve?
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message