spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ofir Manor <ofir.ma...@equalum.io>
Subject Re: What do I loose if I run spark without using HDFS or Zookeeper?
Date Thu, 25 Aug 2016 20:35:27 GMT
Just to add one concrete example regarding HDFS dependency.
Have a look at checkpointing
https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
For example, for Spark Streaming, you can not do any window operation in a
cluster without checkpointing to HDFS (or S3).

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> Hi Kant,
>
> I trust the following would be of use.
>
> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>
> In the heart of it and with reference to points you raised about HDFS, one
> needs to have a working knowledge of Hadoop Core System including HDFS,
> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
> Data is all about horizontal scaling with master and nodes (as opposed to
> vertical scaling like SQL Server running on a Host). and distributed data
> (by default data is replicated three times on different nodes for
> scalability and availability).
>
> Other members including Sean provided the limits on how far one operate
> Spark in its own space. If you are going to deal with data (data in motion
> and data at rest), then you will need to interact with some form of storage
> and HDFS and compatible file systems like S3 are the natural choices.
>
> Zookeeper is not just about high availability. It is used in Spark
> Streaming with Kafka, it is also used with Hive for concurrency. It is also
> a distributed locking system.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 20:52, Mark Hamstra <mark@clearstorydata.com> wrote:
>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra <mark@clearstorydata.com>
>> wrote:
>>
>>> One way you can start to make this make more sense, Sean, is if you
>>> exploit the code/data duality so that the non-distributed data that you are
>>> sending out from the driver is actually paying a role more like code (or at
>>> least parameters.)  What is sent from the driver to an Executer is then
>>> used (typically as seeds or parameters) to execute some procedure on the
>>> Worker node that generates the actual data on the Workers.  After that, you
>>> proceed to execute in a more typical fashion with Spark using the
>>> now-instantiated distributed data.
>>>
>>> But I don't get the sense that this meta-programming-ish style is really
>>> what the OP was aiming at.
>>>
>>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen <sowen@cloudera.com> wrote:
>>>
>>>> Without a distributed storage system, your application can only create
>>>> data on the driver and send it out to the workers, and collect data back
>>>> from the workers. You can't read or write data in a distributed way. There
>>>> are use cases for this, but pretty limited (unless you're running on 1
>>>> machine).
>>>>
>>>> I can't really imagine a serious use of (distributed) Spark without
>>>> (distribute) storage, in a way I don't think many apps exist that don't
>>>> read/write data.
>>>>
>>>> The premise here is not just replication, but partitioning data across
>>>> compute resources. With a distributed file system, your big input exists
>>>> across a bunch of machines and you can send the work to the pieces of data.
>>>>
>>>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth909@gmail.com>
>>>> wrote:
>>>>
>>>>> @Mich I understand why I would need Zookeeper. It is there for fault
>>>>> tolerance given that spark is a master-slave architecture and when a
mater
>>>>> goes down zookeeper will run a leader election algorithm to elect a new
>>>>> leader however DevOps hate Zookeeper they would be much happier to go
with
>>>>> etcd & consul and looks like if we mesos scheduler we should be able
to
>>>>> drop Zookeeper.
>>>>>
>>>>> HDFS I am still trying to understand why I would need for spark. I
>>>>> understand the purpose of distributed file systems in general but I don't
>>>>> understand in the context of spark since many people say you can run
a
>>>>> spark distributed cluster in a stand alone mode but I am not sure what
are
>>>>> its pros/cons if we do it that way. In a hadoop world I understand that
one
>>>>> of the reasons HDFS is there is for replication other words if we write
>>>>> some data to a HDFS it will store that block across different nodes such
>>>>> that if one of nodes goes down it can still retrieve that block from
other
>>>>> nodes. In the context of spark I am not really sure because 1) I am new
2)
>>>>> Spark paper says it doesn't replicate data instead it stores the
>>>>> lineage(all the transformations) such that it can reconstruct it.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh
>>>>> mich.talebzadeh@gmail.com wrote:
>>>>>
>>>>>> You can use Spark on Oracle as a query tool.
>>>>>>
>>>>>> It all depends on the mode of the operation.
>>>>>>
>>>>>> If you running Spark with yarn-client/cluster then you will need
>>>>>> yarn. It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>>>>>
>>>>>> I have not gone and installed Yarn without installing Hadoop.
>>>>>>
>>>>>> What is the overriding reason to have the Spark on its own?
>>>>>>
>>>>>>  You can use Spark in Local or Standalone mode if you do not want
>>>>>> Hadoop core.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property
which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary
damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 24 August 2016 at 21:54, kant kodali <kanth909@gmail.com>
wrote:
>>>>>>
>>>>>> What do I loose if I run spark without using HDFS or Zookeper ? which
>>>>>> of them is almost a must in practice?
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Mime
View raw message