spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: Spark resilience
Date Wed, 16 Apr 2014 06:49:58 GMT
1. Spark prefers to run tasks where the data is, but it is able to move
cached data between executors if no cores are available where the data is
initially cached (which is often much faster than recomputing the data from
scratch). The result is that data is automatically spread out across the
cluster after a few usages, stabilizing performance.

2. The default replication factor is 1 within Spark, but is very easy to
change:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L146
Note that we even have stuff like StorageLevel.MEMORY_ONLY_2 by default,
which provides a replication factor of 2, but you can simply construct a
more specific StorageLevel with a higher replication on demand.


On Tue, Apr 15, 2014 at 11:18 PM, Arpit Tak <arpit.sparkuser@gmail.com>wrote:

>
> 1. If we add more executors to cluster and data is already cached inside
> system(rdds are already there) . so, in that case
> those executors will run job on new executors or not , as rdd are not
> present there??
> if yes, then how the performance on new executors ??
>
> 2. What is the replication factor in spark in memory (as for hadoop
> default is 3 ) and can we change for spark also ??
>
>
>
>
> On Tue, Apr 15, 2014 at 9:53 PM, Manoj Samel <manojsameltech@gmail.com>wrote:
>
>> Thanks Aaron, this is useful !
>>
>> - Manoj
>>
>>
>> On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson <ilikerps@gmail.com>wrote:
>>
>>> Launching drivers inside the cluster was a feature added in 0.9, for
>>> standalone cluster mode:
>>> http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster
>>>
>>> Note the "supervise" flag, which will cause the driver to be restarted
>>> if it fails. This is a rather low-level mechanism which by default will
>>> just cause the whole job to rerun from the beginning. Special recovery
>>> would have to be implemented by hand, via some sort of state checkpointing
>>> into a globally visible storage system (e.g., HDFS), which, for example,
>>> Spark Streaming already does.
>>>
>>> Currently, this feature is not supported in YARN or Mesos fine-grained
>>> mode.
>>>
>>>
>>> On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel <manojsameltech@gmail.com>wrote:
>>>
>>>> Could you please elaborate how drivers can be restarted automatically ?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson <ilikerps@gmail.com>wrote:
>>>>
>>>>> Master and slave are somewhat overloaded terms in the Spark ecosystem
>>>>> (see the glossary:
>>>>> http://spark.apache.org/docs/latest/cluster-overview.html#glossary).
>>>>> Are you actually asking about the Spark "driver" and "executors", or
the
>>>>> standalone cluster "master" and "workers"?
>>>>>
>>>>> To briefly answer for either possibility:
>>>>> (1) Drivers are not fault tolerant but can be restarted automatically,
>>>>> Executors may be removed at any point without failing the job (though
>>>>> losing an Executor may slow the job significantly), and Executors may
be
>>>>> added at any point and will be immediately used.
>>>>> (2) Standalone cluster Masters are fault tolerant and failure will
>>>>> only temporarily stall new jobs from starting or getting new resources,
but
>>>>> does not affect currently-running jobs. Workers can fail and will simply
>>>>> cause jobs to lose their current Executors. New Workers can be added
at any
>>>>> point.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 14, 2014 at 11:00 AM, Ian Ferreira <
>>>>> ianferreira@hotmail.com> wrote:
>>>>>
>>>>>> Folks,
>>>>>>
>>>>>> I was wondering what the failure support modes where for Spark while
>>>>>> running jobs
>>>>>>
>>>>>>
>>>>>>    1. What happens when a master fails
>>>>>>    2. What happens when a slave fails
>>>>>>    3. Can you mid job add and remove slaves
>>>>>>
>>>>>>
>>>>>> Regarding the install on Meso, if I understand correctly the Spark
>>>>>> master is behind a Zookeeper quorum so that isolates the slaves from
a
>>>>>> master failure, but what about the masters behind quorum?
>>>>>>
>>>>>> Cheers
>>>>>> - Ian
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message