cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Roth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12888) Incremental repairs broken for MVs and CDC
Date Fri, 02 Dec 2016 10:09:58 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714685#comment-15714685
] 

Benjamin Roth commented on CASSANDRA-12888:
-------------------------------------------

Let me explain my patch:

Sending streams of tables with MVs (or CDC) through the regular write path has many very big
negative impacts:

1. Bootstrap
During a bootstrap, all ranges from all KS and all CFs that will belong to
the new node will be streamed. MVs are treated like all other CFs and all
ranges that will move to the new node will also be streamed during
bootstrap.
Sending streams of the base tables through the write path will have the
following negative impacts:

   - Writes are sent to the commit log. Not necessary. When node is stopped
   during bootstrap, bootstrap will simply start over. No need to recover from
   commit logs. Non-MV tables won't have a CL anyway
   - MV mutations will not be applied instantly but send to the batch log.
   This is of course necessary during the range movement (if PK of MV differs
   from base table) but what happens: The batchlog will be completely flooded.
   This leads to ridiculously large batchlogs (I observed BLs with 60GB
   size), zillions of compactions and quadrillions of tombstones. This is a
   pure resource killer, especially because BL uses a CF as a queue.
   - Applying every mutation separately causes read-before-writes during MV
   mutation. This is of course an order of magnitude slower than simply
   streaming down an SSTable. This effect becomes even worse while bootstrap
   progresses and creates more and more (uncompacted) SSTables. Many of them
   wont ever be compacted because the batchlog eats all the resources
   available for compaction
   - Streaming down the MV tables AND applying the mutations of the
   basetables leads to redundant writes. Redundant writes are local if PK of
   the MV == PK of the base table and - even worse - remote if not. Remote MV
   updates will impact nodes that aren't even part of the bootstrap.
   - CDC should also not be necessary during bootstrap. A bootstrap is no data change. It
is a data relocation and all data changes must have been logged on the source node before

2. Repair
Negative impact is similar to bootstrap but, ...

   - Sending repairs through write path will not mark the streamed tables
   as repaired. Doing NOT so will instantly solve this
   issue. Much simpler with any other solution
   - It will change the "repair design" a bit. Repairing a base table will
   not automatically repair the MV. But is this bad at all? To be honest it was very hard
for me to understand what I had to do to be sure
   that everything is repaired correctly. Recently I was told NOT to repair MV
   CFs but only to repair the base tables (see comment above from [~tjake]]). This means one
cannot just call
   "nodetool repair $keyspace" - this is complicated, not transparent and it
   sucks. I changed the behaviour in my own branch and let run the dtests for
   MVs. 2 tests failed:
      - base_replica_repair_test of course fails due to the design change
      - really_complex_repair_test fails because it intentionally times out
      the batch log. IMHO this is a bearable situation. It is comparable to
      resurrected tombstones when running a repair after GCGS expired. You also
      would not expect this to be magically fixed. gcgs default is 10
days and I
      can expect that anybody also repairs its MVs during that period, not only
      the base table. I'd suggest to simply delete these 2 tests. They prove nothing any more.

3. Rebuild + Decommision
Similar impacts like bootstrap + repair

I rolled out these changes on our production cluster (including CASSANDRA-12905 + CASSANDRA-12984).
Before, I was not able to bootstrap a node with a load of roughly 280GB. Either it failed
due to WTE (see 12905), it flooded the logs completely with hint delivery failures (also fixed
in 12905) and after having fixed that, the bootstrap didn't even finish within 24h, why I
cancelled it.
After applying the mentioned changes, the bootstrap finished below 5:30h. Repairs also seem
to run quite smoothly so far. Even though it does not fix CASSANDRA-12730 which is a different
story. 

Any thoughts on this?
Anybody there likes to review that?

> Incremental repairs broken for MVs and CDC
> ------------------------------------------
>
>                 Key: CASSANDRA-12888
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12888
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Stefan Podkowinski
>            Assignee: Benjamin Roth
>            Priority: Critical
>             Fix For: 3.0.x, 3.x
>
>
> SSTables streamed during the repair process will first be written locally and afterwards
either simply added to the pool of existing sstables or, in case of existing MVs or active
CDC, replayed on mutation basis:
> As described in {{StreamReceiveTask.OnCompletionRunnable}}:
> {quote}
> We have a special path for views and for CDC.
> For views, since the view requires cleaning up any pre-existing state, we must put all
partitions through the same write path as normal mutations. This also ensures any 2is are
also updated.
> For CDC-enabled tables, we want to ensure that the mutations are run through the CommitLog
so they can be archived by the CDC process on discard.
> {quote}
> Using the regular write path turns out to be an issue for incremental repairs, as we
loose the {{repaired_at}} state in the process. Eventually the streamed rows will end up in
the unrepaired set, in contrast to the rows on the sender site moved to the repaired set.
The next repair run will stream the same data back again, causing rows to bounce on and on
between nodes on each repair.
> See linked dtest on steps to reproduce. An example for reproducing this manually using
ccm can be found [here|https://gist.github.com/spodkowinski/2d8e0408516609c7ae701f2bf1e515e8]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message