spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: What do I loose if I run spark without using HDFS or Zookeeper?
Date Thu, 25 Aug 2016 19:39:18 GMT
Without a distributed storage system, your application can only create data
on the driver and send it out to the workers, and collect data back from
the workers. You can't read or write data in a distributed way. There are
use cases for this, but pretty limited (unless you're running on 1 machine).

I can't really imagine a serious use of (distributed) Spark without
(distribute) storage, in a way I don't think many apps exist that don't
read/write data.

The premise here is not just replication, but partitioning data across
compute resources. With a distributed file system, your big input exists
across a bunch of machines and you can send the work to the pieces of data.

On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth909@gmail.com> wrote:

> @Mich I understand why I would need Zookeeper. It is there for fault
> tolerance given that spark is a master-slave architecture and when a mater
> goes down zookeeper will run a leader election algorithm to elect a new
> leader however DevOps hate Zookeeper they would be much happier to go with
> etcd & consul and looks like if we mesos scheduler we should be able to
> drop Zookeeper.
>
> HDFS I am still trying to understand why I would need for spark. I
> understand the purpose of distributed file systems in general but I don't
> understand in the context of spark since many people say you can run a
> spark distributed cluster in a stand alone mode but I am not sure what are
> its pros/cons if we do it that way. In a hadoop world I understand that one
> of the reasons HDFS is there is for replication other words if we write
> some data to a HDFS it will store that block across different nodes such
> that if one of nodes goes down it can still retrieve that block from other
> nodes. In the context of spark I am not really sure because 1) I am new 2)
> Spark paper says it doesn't replicate data instead it stores the
> lineage(all the transformations) such that it can reconstruct it.
>
>
>
>
>
>
> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebzadeh@gmail.com
> wrote:
>
>> You can use Spark on Oracle as a query tool.
>>
>> It all depends on the mode of the operation.
>>
>> If you running Spark with yarn-client/cluster then you will need yarn. It
>> comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>
>> I have not gone and installed Yarn without installing Hadoop.
>>
>> What is the overriding reason to have the Spark on its own?
>>
>>  You can use Spark in Local or Standalone mode if you do not want Hadoop
>> core.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 August 2016 at 21:54, kant kodali <kanth909@gmail.com> wrote:
>>
>> What do I loose if I run spark without using HDFS or Zookeper ? which of
>> them is almost a must in practice?
>>
>>
>>

Mime
View raw message