lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Perrin Bignoli <>
Subject Delta Import Failed to Complete
Date Mon, 12 Feb 2018 22:04:47 GMT

A couple of weeks ago, I ran into an unusual problem with Solr on which I could find previous

I have a 4 node Solr cluster with 2 collections, ‘A’ and ‘B’.  Each of the collections
has 1 shard and 3 replicas.  Both collections are updated with a delta-import that pulls from
a postgres database every 5 minutes.  Collection ‘A’ is very small (~1.5k documents, ~7
MB) and there are no queries run against it.  Collection ‘B’ is ~90k documents and about
~500MB and has a heavy query load during certain parts of the day.  There is an auto hard
commit every 15 seconds.  Both collections run a nightly full import during low query load
without issue.

There was a large delta on Collection ‘B’ that caused nearly every document to be updated.
 This occurred while the query load was high.  Collection ‘B’ has 2 different entity types,
‘1’, and ‘2,’ which are in a ~1:3 ratio.  There were both “adds” and “deletes”.

Looking at the logs, the data import process completed for entity ‘1’, but not entity
‘2.’  There were no errors, exceptions, or warnings in the log and the telemetry did not
show that any of the cluster nodes ran out of heap or diskspace.  It is usually the case that
a full import (or large delta) would run well within 20 minutes, but this particular import
was running for at least an hour.

A more concerning development was that soon after the data import began to process entity
‘2,’ all of the nodes in the cluster began to continuously send a high volume of /update
add requests that contained up to 200 document ids.  This high volume of adds occurred for
at least 15 minutes and appears to have spiked the CPU and GC on the cluster nodes and led
to a high volume of query timeouts.  Typically, the /update adds messages would contain 1
(or rarely 2) documents.

The cluster was restarted in a rolling fashion (one node at a time), but this didn’t appear
to resolve all of the issues.  Only after all of the replicas were deleted and then re-added
(through the Admin console) did the flood of /updates subside.

Has anyone ever observed this kind of behavior?  Is there a known issue or a procedure to
follow for getting a cluster out of this state?

I was able to reproduce the /update “adds” flood by starting a large delta, putting the
cluster under heavy load, and then forcing a second delta immediately after the first delta
finished.  However, this is obviously not exactly the same event, because the large deltas
actually ran to completion for both entity ‘1’ and entity ‘2’.  In this case, forcing
a commit seemed to reduce the volume of the large /update adds messages, but didn’t completely
eliminate them.  Deleting and re-adding the replicas seemed to fix this issue as well.

Any insight into this would be very helpful.  Thanks!

Perrin Bignoli
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message