spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: What do I loose if I run spark without using HDFS or Zookeeper?
Date Fri, 26 Aug 2016 12:46:05 GMT
And yes any technology needs time for maturity but that said it shouldn't
stop us from transitioning............

Depends on the application and how mission critical the business it is
deployed for. If you are using a tool for a Bank's Credit Risk
(Surveillance, Anti-Money Laundering, Employee Compliance, Anti-Fraud etc)
and the tool missed a big chunk for whatever reason then, the first thing
will be the Bank will be fined in ($millions)  and I will be looking for a
new Job in London transport.

On the hand if the tools is used for some social media, sentiment analysis
and all that sort of stuff, I don't think anyone is going to lose sleep.

HTH









Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 August 2016 at 12:58, kant kodali <kanth909@gmail.com> wrote:

> @Steve your arguments make sense however there is a good majority of
> people who have extensive experience with zookeeper prefer to avoid
> zookeeper and given the ease of consul (which btw uses raft for the
> election) and etcd lot of us are more inclined to avoid ZK.
>
> And yes any technology needs time for maturity but that said it shouldn't
> stop us from transitioning. for example people started using spark when it
> first released instead of waiting for spark 2.0 where there are lot of
> optimizations and bug fixes.
>
>
>
> On Fri, Aug 26, 2016 2:50 AM, Steve Loughran stevel@hortonworks.com wrote:
>
>>
>> On 25 Aug 2016, at 22:49, kant kodali <kanth909@gmail.com> wrote:
>>
>> yeah so its seems like its work in progress. At very least Mesos took the
>> initiative to provide alternatives to ZK. I am just really looking forward
>> for this.
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>>
>>
>>
>> I worry about any attempt to implement distributed consensus systems:
>> they take time in production to get right.
>>
>> 1. There's the need to prove that what you are building is valid if the
>> implementation matches the specification. That has apparently been done for
>> ZK, though given the complexity of maths involved, I cannot vouch for that
>> myself:
>> https://blog.acolyer.org/2015/03/09/zab-high-performance-
>> broadcast-for-primary-backup-systems/
>>
>> 2. you need to run it in production to find the problems. Google's Chubby
>> paper hints about the things they found out went wrong there. As far as ZK
>> goes, jepsen hints its robust
>>
>> https://aphyr.com/posts/291-jepsen-zookeeper
>>
>> If it has weaknesses, I'd point at
>>  - it's security model
>>  -it's lack of helpfulness when there are kerberos/SASL auth problems (ZK
>> server closes connection; client sees connection failure and retries),
>>  -the fact that it's failure modes aren't always understood by people
>> coding against it.
>>
>> http://blog.cloudera.com/blog/2014/03/zookeeper-resilience-at-pinterest/
>>
>> the Raft algorithm appears to be easier to implement than Paxos; there
>> are things built on it and I look forward to seeing what works/doesn't work
>> in production.
>>
>> Certainly Aphyr found problems when it pointed jepsen at etcd, though
>> being a 2014 piece of work, I expect those specific problems to have been
>> addressed. The main thing is: it shows how hard it is to get things right
>> in the presence of complex failures.
>>
>> Finally, regarding S3
>>
>> You can use S3 object store as a source of data in queries/streaming,
>> and, if done carefully, a destination. Performance is variable...something
>> some of us are working on there, across S3a, spark and hive.
>>
>> Conference placement: I shall be talking on that topic at Spark Summit
>> Europe if you want to find out more: https://spark-summit.org/eu-2016/
>>
>>
>> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgummelt@mesosphere.io
>> wrote:
>>
>> Mesos also uses ZK for leader election.  There seems to be some effort in
>> supporting etcd, but it's in progress: https://issues.
>> apache.org/jira/browse/MESOS-1806
>>
>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali <kanth909@gmail.com> wrote:
>>
>> @Ofir @Sean very good points.
>>
>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>> many things but for our use case all we need is for high availability and
>> given the devops people frustrations here in our company who had extensive
>> experience managing large clusters in the past we would be very happy to
>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>> through etcd and consul and if that is true I will be left with the
>> following stack
>>
>>
>>
>>
>>
>> Spark + Mesos scheduler + Distributed File System or to be precise I
>> should say Distributed Storage since S3 is an object store so I guess this
>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>> I set all this up
>>
>>
>>
>>
>>
>>

Mime
View raw message