spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <>
Subject Re: What do I loose if I run spark without using HDFS or Zookeeper?
Date Sun, 28 Aug 2016 01:49:06 GMT
I understand now that for I cannot use spark streaming window operation without
checkpointing to HDFS as pointed out by @Ofir but Without window operation I
don't think we can do much with spark streaming. so since it is very essential
can I use Cassandra as a distributed storage? If so, can I see an example on how
I can tell spark cluster to use Cassandra for checkpointing and others if at

On Fri, Aug 26, 2016 9:50 AM, Steve Loughran wrote:

On 26 Aug 2016, at 12:58, kant kodali < > wrote:
@Steve your arguments make sense however there is a good majority of people who
have extensive experience with zookeeper prefer to avoid zookeeper and given the
ease of consul (which btw uses raft for the election) and etcd lot of us are
more inclined to avoid ZK.
And yes any technology needs time for maturity but that said it shouldn't stop
us from transitioning. for example people started using spark when it first
released instead of waiting for spark 2.0 where there are lot of optimizations
and bug fixes.

One way to look at the problem is "what is the cost if something doesn't work?"
If it's some HA consensus system, failure modes are "consensus failure,
everything goes into minority mode and offline". service lost, data fine.
Another is "partition with both groups thinking they are in charge", which is
more dangerous. then there's "partitioning event not detected", which may be
so: consider the failure modes and then consider not so much whether the tech
you are using is vulnerable to it, but "if it goes wrong, does it matter?"

Even before HDFS had HA with ZK/bookkeeper it didn't fail very often. And if you
looked at the causes of those failures, things like backbone switch failure are
so traumatic that things like ZK/etcd failures aren't going to make much of a
difference. The filesystem is down.
Generally, integrity gets priority over availability. That said, S3 and the like
have put availability ahead of consistency; Cassandra can offer that
too.—sometimes it is the right strategy
View raw message