lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amrit Sarkar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-11278) CdcrBootstrapTest failing in branch_6_6
Date Fri, 01 Sep 2017 13:02:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150469#comment-16150469
] 

Amrit Sarkar commented on SOLR-11278:
-------------------------------------

Alright!

There are two issues in the troubled CDCR test :

1. Why bootstrap is called two seperate times in single life cycle

-> first time bootstrap is called when we enable CDCR, and 1000 documents are boostrapped
to target

-> the second one is when bootstrapping is happening in background and new updates are
getting shoved down to source. Once the first boostrap is successfully completed, target sees
(I speculate) it is much behind and issues bootstrap again, which I don't really understand
and putting extra logging doesn't really helps too. I am seeing more. OR all this is wrong,
and boostrap is happening just one time in a weird manner.

-> as above listed, the first bootstrap and indexing into source cluster happens simulteneouly
at one point. How it is suppose to behave that time.

according to code: CdcrRequestHandler:BootstrapCallable:call()::

{code}
........
 @Override
    public Boolean call() throws Exception {
      boolean success = false;
      Throwable exception = null;
      UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
      // we start buffering updates as a safeguard however we do not expect
      // to receive any updates from the source during bootstrap
      ulog.bufferUpdates();
      try {
        commitOnLeader(masterUrl);
........
{code}

2. Why bootstrap issues at the second time (if it is) fails miserably

-> when the bootstrap is issued, it calls boostrapCallable to apply buffer updates and
if "fetchIndex is not successful OR boostrapCallable is closed", drop those updates from UpdateLog
in {{finally}} block

-> in this case, when bootstrap gets issued :: it checks whether it can get the LOCK ::

{code}
private void handleBootstrapAction(SolrQueryRequest req, SolrQueryResponse rsp) throws IOException,
SolrServerException {
    .......................

    Runnable runnable = () -> {
      Lock recoveryLock = req.getCore().getSolrCoreState().getRecoveryLock();
      boolean locked = recoveryLock.tryLock();
      SolrCoreState coreState = core.getSolrCoreState();
      try {
        if (!locked)  {
          log.info("we reached this point :: CANCEL BOOTSTRAP, locked :: " + locked);
          handleCancelBootstrap(req, rsp);
        } 
{code}

and if not, it issues CANCEL BOOTSTRAP, which happend exactly in this case. CANCEL_BOOSTRAP
is issued and applied QUIETLY.

->  in the background within the bootstrap,  applyBufferUpdates, in ulog, {{tlog}} is mysteriously
NULL, suggesting no updates were recieved, umm why? and return bootstrapFuture as null and
set the STATE of replication as ACTIVE.

according to code :: UpdatesLog.java ::

{code}
/** Returns the Future to wait on, or null if no replay was needed */
  public Future<RecoveryInfo> applyBufferedUpdates() {
    // recovery trips this assert under some race - even when
    // it checks the state first
    // assert state == State.BUFFERING;

    // block all updates to eliminate race conditions
    // reading state and acting on it in the update processor
    versionInfo.blockUpdates();
    try {
      cancelApplyBufferUpdate = false;
      log.info("applyBufferedUpdates :: state :: start :: " + state);
      if (state != State.BUFFERING){
        return null;
      }
      operationFlags &= ~FLAG_GAP;

      // handle case when no log was even created because no updates
      // were received.
      if (tlog == null) {
        log.info("applyBufferedUpdates :: state :: middle 1 :: " + state);
        state = State.ACTIVE;
        return null;
      }
      log.info("applyBufferedUpdates :: state :: middle 2 :: " + state);
{code}

-> now in finally block, the fetchIndex is successfull but "closed" is true (DUE TO CANCEL
BOOTSTRAP apprently) which is initialised to false:: it tries to drop the buffer updates from
ulog where ::

{code}
 /** Returns true if we were able to drop buffered updates and return to the ACTIVE state
*/
  public boolean dropBufferedUpdates() {
    versionInfo.blockUpdates();
    try {
      log.info("dropBufferedUpdates :: state :: start " + state);
      if (state != State.BUFFERING){
        return false;
      }
.......
{code}

STATE is ACTIVE and returns {{false}}.

faling the "assert dropped" in BootstrapCallable and eventually emitting: "bootstrap operation
failed".

Once bootstrap fails, in CdcrBootsrapTest::waitTargetToSync :: it waits for 120 seconds, and
recieves constant numFound from target, == 1000 / 1001 / 1100, any number basically before
boostrap failed and eventually Assertion at CdcrBootsrapTest fails.

There are still some unanswered questions like:

1. Why in the first place two boostrap command are issued OR we think we two are issued and
it one wrapped up in an awkward life time. Why the conventional forwarding not happening?
2. When it tries to drop the buffer updates, why the tlog is null? It has recieved updates
from source, but why it is NULL in target cluster. Is there a synchrnization issue?

so the matter of the question is this lock ::

{code}
Lock recoveryLock = req.getCore().getSolrCoreState().getRecoveryLock();
      boolean locked = recoveryLock.tryLock();
{code}

More insights will be deeply appreciated. I believe we are very close.

> CdcrBootstrapTest failing in branch_6_6
> ---------------------------------------
>
>                 Key: SOLR-11278
>                 URL: https://issues.apache.org/jira/browse/SOLR-11278
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: CDCR
>            Reporter: Amrit Sarkar
>            Assignee: Varun Thacker
>         Attachments: SOLR-11278-cancel-bootstrap-on-stop.patch, SOLR-11278.patch, test_results
>
>
> I ran beast for 10 rounds:
> ant beast -Dtestcase=CdcrBootstrapTest -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=vi
-Dtests.timezone=Asia/Yekaterinburg -Dtests.asserts=true -Dtests.file.encoding=US-ASCII -Dbeast.iters=10
> and seeing following failure:
> {code}
>   [beaster] [01:37:16.282] FAILURE  153s | CdcrBootstrapTest.testBootstrapWithSourceCluster
<<<
>   [beaster]    > Throwable #1: java.lang.AssertionError: Document mismatch on target
after sync expected:<2000> but was:<1000>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message