cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Capwell (Jira)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-16909) ☂ Medium Term Repair Improvements
Date Thu, 02 Sep 2021 00:44:00 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-16909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Capwell updated CASSANDRA-16909:
--------------------------------------
    Description: 
This is to track the repair improvement works defined on the dev list.

Email was: [DISCUSS] Repair Improvement Proposal

This JIRA will track the related tasks and be used for documentaiton

TODO: document Preview and IR
{code}
Repair Coordinator State
1. ActiveRepairService.recordRepairStatus
    1. Maps JMX command (int) to (status, msg)
2. Get/validate ColumnFamilies
    1. Skip if no ColumnFamilies to repair
3. Get replicas for ranges
    1. Ignore ranges not covered by this instance IFF --ignore-unreplicated-keyspaces, else
Fail
    2. Skip if no ranges to repair
    3. If --force filter out non-alive (FailureDetector.isAlive) participants
4. [Not PREVIEW]Update SystemDistributedKeyspace's parent_repair_history
5. ActiveRepairService.prepareForRepair
    1. TODO Should this be under PREPARE or still part of validation?
        1. If CompactionsPendingThreshold triggered, Fail
        2. registerParentRepairSession - update map of UUID -> ParentRepairSession
    2. Send PREPARE_MSG awaiting responses [possible failures: timeout waiting, participate
failure]
        1. [improvement] PREPARE_MSG should be idempotent and if no reply within X, retry
Y times
        2. Known Failures; all are retryable at this stage
            1. Timeout
            2. Participate goes from alive to down
            3. CompactionsPendingThreshold triggered
                1. Not included in org.apache.cassandra.exceptions.RequestFailureReason#forException,
so looks the same as Unexpected error
                    1. If updated and in mixed-mode, the update falls back to UNKNOWN, which
then matches the unexpected error behavior
            4. Unexpected error (this removes the session)
6. Run RepairTask (normal, IR, preview); see coordinator state for each type
7. On Success
    1. Update SystemDistributedKeyspace.parent_repair_history to show the successful ranges
    2. If any sub-session failed, fail the job
    3. ActiveRepairService.cleanUp - message to all participates to clean up
        1. TODO: why is this only on success and not failures as well?
8. On Exception
    1. fail

Normal/Preview Repair Coordinator State
1. For each common range
    1. ActiveRepairService.submitRepairSession
        1. Creates/run a RepairSession for each CommonRange
    2. Once all RepairSessions done
        1. [not consistent cross each type] handle session errors

RepairSession
1. [Preview Repair - kind=REPAIRED] register with LocalSessions for IR state changes
2. RepairSession.start
    1. [Not Preview Repair] Registering the session into SystemDistributedKeyspace's table
repair_history
    2. If endpoints is empty
        1. [UNHANDLED - downstream logic does not handle this case] Set future with empty
state (which is later seen as Failed... but not a failed future)
        2. [Not Preview Repair] Mark session failed in repair_history
    3. Check all endpoints, if any is down and hasSkippedReplicas=false, Fail the session
    4. For each table
        1. Create a RepairJob
        2. Execute job in RepairTask's executor
        3. await all jobs
            1. If all success
                1. Set session result to include the job results
            2. If any fail
                1. Fail the session future
                2. [Question] why does this NOT update repair_history like other failures?

RepairJob
1. [parallelism in [SEQUENTIAL, DATACENTER_AWARE] and not IR] send SNAPSHOT_MSG to all participates
    1. [improvement] SNAPSHOT_MSG should be idempotent so coordinator can retry if no ACK.
 This calls org.apache.cassandra.repair.TableRepairManager#snapshot with the RepairRunnable
ID; this makes sure to snapshot once unless force=true (RepairOptions !(dataCenters.isEmpty
and hosts.isEmpty()))
    2. [improvement] This task is short lived, so rather than running in a cached pool (which
may allocate a thread), inline or add a notation of cooperative sharing if retries are implemented
2. Await snapshot success
3. Send VALIDATION_REQ to all participates (all at once, or batched based off parallelism)
    1. [bug] MerkleTree may be large, even if off-heap; this can cause the coordinator to
OOM (heap or direct); there is no bounds to the number of MerkleTree which may be in-flight
    2. [improvement] VALIDATION_REQ could be made idempotent.  Right now we create a Validator
and submit to ValidationManager, but could dedupe based off session id
4. Await validation success
5. Stream any-all conflicting ranges (2 modes: optimiseStreams && not pullRepair =
optimisedSyncing, else standardSyncing)
    1. Create a SyncTask (Local, Asymmetric, Symmetric) for each conflicting range
        1. Local: create stream plan and use streaming
        2. Asymmetric/Symmetric: send SYNC_REQ
            1. [bug][improvement] No ACK is done, so if this message is dropped streaming
does not start on the participate
            2. [improvement] Both AsymmetricRemoteSyncTask, and SymmetricRemoteSyncTask are
the same class; they are copy/paste clones of each other; the only difference is AsymmetricRemoteSyncTask
creates the SyncRequest with asymmetric=true
            3. [improvement] Can be idempotent (when remote); currently just starts streaming
right away, would need to dedup on the session
6. Await streaming complete
7. onSuccess
    1. [not preview repair] update repair_history, marking the session success
    2. Set the future as success
8. onFailure
    1. Abort validation tasks
    2. [not preview repair] update repair_history, marking the session failed
    3. Set the future to a failure


Repair Participate State
1. Receive PREPARE_MSG
    1. [improvement] PREPARE_MSG should be idempotent
        1. [current state] If parentRepairSessions contains the session, it ignores the request
and noops; but does NOT validate that the sessions match
        2. [current state] mostly idempotent, assuming CompactionsPendingThreshold does not
trigger
2. [parallelism in [SEQUENTIAL, DATACENTER_AWARE] and not IR] Receive SNAPSHOT_MSG
    1. Create a table snapshot using the RepairRunnable ID.  If !RepairOption.isGlobal, then
override the snapshot if present
3. Receive VALIDATION_REQ
    1. Creates a Validator and submits to CompactionManager's validationExecutor
    2. Core logic: org.apache.cassandra.repair.ValidationManager#doValidation
    3. Iterate over each partition/row, updating a MerkleTree
    4. When done, switch to the ANTI_ENTROPY stage
    5. If coordinator is remote
        1. Send a VALIDATION_RSP back with the MerkleTree (or null if failed)
    6. Else
        1. Switch to ANTI_ENTROPY again
        2. Attempt to move MerkleTree off-heap
        3. Forward message to ActiveRepairService.handleMessage
4. [SyncTask in [AsymmetricRemoteSyncTask, SymmetricRemoteSyncTask]] Receive SYNC_REQ
    1. Creates a stream plan and use streaming (StreamingRepairTask)
{code}

  was:
This is to track the repair improvement works defined on the dev list.

Email was: [DISCUSS] Repair Improvement Proposal

This JIRA will track the related tasks and be used for documentaiton


> ☂ Medium Term Repair Improvements
> ---------------------------------
>
>                 Key: CASSANDRA-16909
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16909
>             Project: Cassandra
>          Issue Type: Epic
>          Components: Consistency/Repair
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>
> This is to track the repair improvement works defined on the dev list.
> Email was: [DISCUSS] Repair Improvement Proposal
> This JIRA will track the related tasks and be used for documentaiton
> TODO: document Preview and IR
> {code}
> Repair Coordinator State
> 1. ActiveRepairService.recordRepairStatus
>     1. Maps JMX command (int) to (status, msg)
> 2. Get/validate ColumnFamilies
>     1. Skip if no ColumnFamilies to repair
> 3. Get replicas for ranges
>     1. Ignore ranges not covered by this instance IFF --ignore-unreplicated-keyspaces,
else Fail
>     2. Skip if no ranges to repair
>     3. If --force filter out non-alive (FailureDetector.isAlive) participants
> 4. [Not PREVIEW]Update SystemDistributedKeyspace's parent_repair_history
> 5. ActiveRepairService.prepareForRepair
>     1. TODO Should this be under PREPARE or still part of validation?
>         1. If CompactionsPendingThreshold triggered, Fail
>         2. registerParentRepairSession - update map of UUID -> ParentRepairSession
>     2. Send PREPARE_MSG awaiting responses [possible failures: timeout waiting, participate
failure]
>         1. [improvement] PREPARE_MSG should be idempotent and if no reply within X, retry
Y times
>         2. Known Failures; all are retryable at this stage
>             1. Timeout
>             2. Participate goes from alive to down
>             3. CompactionsPendingThreshold triggered
>                 1. Not included in org.apache.cassandra.exceptions.RequestFailureReason#forException,
so looks the same as Unexpected error
>                     1. If updated and in mixed-mode, the update falls back to UNKNOWN,
which then matches the unexpected error behavior
>             4. Unexpected error (this removes the session)
> 6. Run RepairTask (normal, IR, preview); see coordinator state for each type
> 7. On Success
>     1. Update SystemDistributedKeyspace.parent_repair_history to show the successful
ranges
>     2. If any sub-session failed, fail the job
>     3. ActiveRepairService.cleanUp - message to all participates to clean up
>         1. TODO: why is this only on success and not failures as well?
> 8. On Exception
>     1. fail
> Normal/Preview Repair Coordinator State
> 1. For each common range
>     1. ActiveRepairService.submitRepairSession
>         1. Creates/run a RepairSession for each CommonRange
>     2. Once all RepairSessions done
>         1. [not consistent cross each type] handle session errors
> RepairSession
> 1. [Preview Repair - kind=REPAIRED] register with LocalSessions for IR state changes
> 2. RepairSession.start
>     1. [Not Preview Repair] Registering the session into SystemDistributedKeyspace's
table repair_history
>     2. If endpoints is empty
>         1. [UNHANDLED - downstream logic does not handle this case] Set future with empty
state (which is later seen as Failed... but not a failed future)
>         2. [Not Preview Repair] Mark session failed in repair_history
>     3. Check all endpoints, if any is down and hasSkippedReplicas=false, Fail the session
>     4. For each table
>         1. Create a RepairJob
>         2. Execute job in RepairTask's executor
>         3. await all jobs
>             1. If all success
>                 1. Set session result to include the job results
>             2. If any fail
>                 1. Fail the session future
>                 2. [Question] why does this NOT update repair_history like other failures?
> RepairJob
> 1. [parallelism in [SEQUENTIAL, DATACENTER_AWARE] and not IR] send SNAPSHOT_MSG to all
participates
>     1. [improvement] SNAPSHOT_MSG should be idempotent so coordinator can retry if no
ACK.  This calls org.apache.cassandra.repair.TableRepairManager#snapshot with the RepairRunnable
ID; this makes sure to snapshot once unless force=true (RepairOptions !(dataCenters.isEmpty
and hosts.isEmpty()))
>     2. [improvement] This task is short lived, so rather than running in a cached pool
(which may allocate a thread), inline or add a notation of cooperative sharing if retries
are implemented
> 2. Await snapshot success
> 3. Send VALIDATION_REQ to all participates (all at once, or batched based off parallelism)
>     1. [bug] MerkleTree may be large, even if off-heap; this can cause the coordinator
to OOM (heap or direct); there is no bounds to the number of MerkleTree which may be in-flight
>     2. [improvement] VALIDATION_REQ could be made idempotent.  Right now we create a
Validator and submit to ValidationManager, but could dedupe based off session id
> 4. Await validation success
> 5. Stream any-all conflicting ranges (2 modes: optimiseStreams && not pullRepair
= optimisedSyncing, else standardSyncing)
>     1. Create a SyncTask (Local, Asymmetric, Symmetric) for each conflicting range
>         1. Local: create stream plan and use streaming
>         2. Asymmetric/Symmetric: send SYNC_REQ
>             1. [bug][improvement] No ACK is done, so if this message is dropped streaming
does not start on the participate
>             2. [improvement] Both AsymmetricRemoteSyncTask, and SymmetricRemoteSyncTask
are the same class; they are copy/paste clones of each other; the only difference is AsymmetricRemoteSyncTask
creates the SyncRequest with asymmetric=true
>             3. [improvement] Can be idempotent (when remote); currently just starts streaming
right away, would need to dedup on the session
> 6. Await streaming complete
> 7. onSuccess
>     1. [not preview repair] update repair_history, marking the session success
>     2. Set the future as success
> 8. onFailure
>     1. Abort validation tasks
>     2. [not preview repair] update repair_history, marking the session failed
>     3. Set the future to a failure
> Repair Participate State
> 1. Receive PREPARE_MSG
>     1. [improvement] PREPARE_MSG should be idempotent
>         1. [current state] If parentRepairSessions contains the session, it ignores the
request and noops; but does NOT validate that the sessions match
>         2. [current state] mostly idempotent, assuming CompactionsPendingThreshold does
not trigger
> 2. [parallelism in [SEQUENTIAL, DATACENTER_AWARE] and not IR] Receive SNAPSHOT_MSG
>     1. Create a table snapshot using the RepairRunnable ID.  If !RepairOption.isGlobal,
then override the snapshot if present
> 3. Receive VALIDATION_REQ
>     1. Creates a Validator and submits to CompactionManager's validationExecutor
>     2. Core logic: org.apache.cassandra.repair.ValidationManager#doValidation
>     3. Iterate over each partition/row, updating a MerkleTree
>     4. When done, switch to the ANTI_ENTROPY stage
>     5. If coordinator is remote
>         1. Send a VALIDATION_RSP back with the MerkleTree (or null if failed)
>     6. Else
>         1. Switch to ANTI_ENTROPY again
>         2. Attempt to move MerkleTree off-heap
>         3. Forward message to ActiveRepairService.handleMessage
> 4. [SyncTask in [AsymmetricRemoteSyncTask, SymmetricRemoteSyncTask]] Receive SYNC_REQ
>     1. Creates a stream plan and use streaming (StreamingRepairTask)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message