cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Podkowinski (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra
Date Thu, 05 Apr 2018 11:35:00 GMT


Stefan Podkowinski commented on CASSANDRA-14346:

If we keep the scope of this ticket to schedule repairs in Cassandra, we should really talk
a bit more about the different requirements users have and how using the described solution
in practice would look like.

There are several aspect to consider for coming up with a working repair schedule:
 * number of tables (from a single table per cluster to hundreds of tables)
 * priority in repairing tables (some tables should be repaired more often, others never at
 * data size per table (large table should not block repairs for smaller more important ones)
 * predictable cluster load (try to schedule repairs off hours)
 * sustainable repair intensity (repair sessions should not leak into peak hours)
 * different gc_grace periods (plan intervals for each table so we can tolerate missing a
repair run)

Repair schedules, which will take these aspects into account, require a certain flexibility
and some more careful configuration. Tools, such as reaper, allow you to put together such
plans already. Looking at the configuration options described in the design document, I'd
probably still want to use such an external tool. That would be mostly due to the use of delays
instead of recurring repair times and the way you'd have to configure repairs on table level,
which probably gets a bit "messy" fast when you have a lot of tables. The lack of any reporting
doesn't help either to further tune these config options afterwards.

I think the intention is to keep the scope of this ticket to "integrated repair scheduling
and execution", so I'll spare you any of my thoughts about how we should coordinate and execute
repairs differently in a post CASSANDRA-9143 world. But if we want to solve scheduling on
top of our existing repair implementation, we have to make sure that we can compete with existing
3rd party solutions.

So far it was already suggested to move on incrementally. But then we also have to think about
how improvements could be implemented on top of the proposed solution. I'd assume that optimizations
would be easier to implement in external tools or sidecars that communicates via an IPC interface,
compared to a baked in solution, which is using the yaml config, table properties, or has
to deal with upgrade paths. From my impression, 3rd party projects are probably also a better
place to quickly iterate on these kind of problems.

> Scheduled Repair in Cassandra
> -----------------------------
>                 Key: CASSANDRA-14346
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>         Attachments: ScheduledRepairV1_20180327.pdf
> There have been many attempts to automate repair in Cassandra, which makes sense given
that it is necessary to give our users eventual consistency. Most recently CASSANDRA-10070,
CASSANDRA-8911 and CASSANDRA-13924 have all looked for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), which
we spoke about last year at NGCC. Given the positive feedback at NGCC we focussed on getting
it production ready and have now been using it in production to repair hundreds of clusters,
tens of thousands of nodes, and petabytes of data for the past six months. Also based on feedback
at NGCC we have invested effort in figuring out how to integrate this natively into Cassandra
rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our implementation into
Cassandra, and have created a [design document|]
showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would be greatly
appreciated about the interface or v1 implementation features. I have tried to call out in
the document features which we explicitly consider future work (as well as a path forward
to implement them in the future) because I would very much like to get this done before the
4.0 merge window closes, and to do that I think aggressively pruning scope is going to be
a necessity.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message