cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nate McCall (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-14346) Scheduled Repair in Cassandra
Date Tue, 03 Apr 2018 20:36:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424554#comment-16424554
] 

Nate McCall edited comment on CASSANDRA-14346 at 4/3/18 8:35 PM:
-----------------------------------------------------------------

{quote}Do you think it is good for the community that every user is inventing this (complex)
functionality again and again with different requirements on external tools?
{quote}
 
 Absolutely! This gets folks involved in the ecosystem, gaining an understanding of a critical
piece of functionality while allowing them to do so in an environment  in which they are
comfortable. 
  
 We saw this with thrift-based drivers early on. There were at one point eight Java drivers,
but Astyanax eventually won out because it was a better design that catered to the most common
Java programming paradigms. The net effect of this is that we trained a whole lot of devs
on how to effectively use the APIs and were at the point where we as a community answered
thrift API and data modeling questions in minutes regardless of time of day or channel in
which they came in. 
  
{quote}We continue doing nothing and the community just solves this in different ways.
{quote}
 
 So we find ourselves again in a spot where opinionated designs are competing and a vendor
is offering a commercial solution. I don't call that nothing. We have multiple working solutions
_right now_ that folks will pick based on the needs of their environments. Operations of distsys
(at any scale) is quite different from one shop to another. 
  
 We _are_ duplicating effort but fundamentally, broader community effort, not necessarily
core Cassandra development resources. The largest benefit of this is that we will be stressing/soak
testing all the recent work done on repair so the mechanism itself will be solid. Once we
figure out what works best for most users, we build from there, perhaps focusing our efforts
in the meantime on a meaningful feedback and control mechanism to make this a whole lot easier
when we do. 
  
 I want to be clear that from what I have read and seen so far, I think [~jolynch] and [~vinaykumarcse] have
done excellent work on thinking this through. I'm calling into question the timing and prioritization
(vs. CASSANDRA-12944 and/or general purpose management plumbing revamp) and maybe still whether
we are in process, side-car'ed or externally managed (or some combination?), but i'll admit
there are a quite debatable set of pros and cons for each when these are all listed out.
  
 My thoughts at this point are that we (for 4.0) ensure repair works really well, invoked
similarly as it is today out of the box and provide links to external options, and we continue
this ticket/general discussion targeting 'trunk' in a post 4.0 released world. 

EDIT: to be clear, I *do* think status-quo is the way to go for shipping 4.0. Not for beyond.


was (Author: zznate):
bq. Do you think it is good for the community that every user is inventing this (complex)
functionality again and again with different requirements on external tools?
 
Absolutely! This gets folks involved in the ecosystem, gaining an understanding of a critical
piece of functionality while allowing them to do so in an environment  in which they are
comfortable. 
 
We saw this with thrift-based drivers early on. There were at one point eight Java drivers,
but Astyanax eventually won out because it was a better design that catered to the most common
Java programming paradigms. The net effect of this is that we trained a whole lot of devs
on how to effectively use the APIs and were at the point where we as a community answered
thrift API and data modeling questions in minutes regardless of time of day or channel in
which they came in. 
 
bq. We continue doing nothing and the community just solves this in different ways.
 
So we find ourselves again in a spot where opinionated designs are competing and a vendor
is offering a commercial solution. I don't call that nothing. We have multiple working solutions
_right now_ that folks will pick based on the needs of their environments. Operations of distsys
(at any scale) is quite different from one shop to another. 
 
We _are_ duplicating effort but fundamentally, broader community effort, not necessarily core
Cassandra development resources. The largest benefit of this is that we will be stressing/soak
testing all the recent work done on repair so the mechanism itself will be solid. Once we
figure out what works best for most users, we build from there, perhaps focusing our efforts
in the meantime on a meaningful feedback and control mechanism to make this a whole lot easier
when we do. 
 
I want to be clear that from what I have read and seen so far, I think [~jolynch] and [~vinaykumarcse] have
done excellent work on thinking this through. I'm calling into question the timing and prioritization
(vs. CASSANDRA-12944 and/or general purpose management plumbing revamp) and maybe still whether
we are in process, side-car'ed or externally managed (or some combination?), but i'll admit
there are a quite debatable set of pros and cons for each when these are all listed out.
 
My thoughts at this point are that we (for 4.0) ensure repair works really well, invoked similarly
as it is today out of the box and provide links to external options, and we continue this
ticket/general discussion targeting 'trunk' in a post 4.0 released world. 

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes sense given
that it is necessary to give our users eventual consistency. Most recently CASSANDRA-10070,
CASSANDRA-8911 and CASSANDRA-13924 have all looked for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), which
we spoke about last year at NGCC. Given the positive feedback at NGCC we focussed on getting
it production ready and have now been using it in production to repair hundreds of clusters,
tens of thousands of nodes, and petabytes of data for the past six months. Also based on feedback
at NGCC we have invested effort in figuring out how to integrate this natively into Cassandra
rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our implementation into
Cassandra, and have created a [design document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would be greatly
appreciated about the interface or v1 implementation features. I have tried to call out in
the document features which we explicitly consider future work (as well as a path forward
to implement them in the future) because I would very much like to get this done before the
4.0 merge window closes, and to do that I think aggressively pruning scope is going to be
a necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message