spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: Is HDFS required for Spark streaming?
Date Tue, 08 Sep 2015 18:54:31 GMT
You can use local directories in that case but it is not recommended and
not a well-test code path (so I have no idea what can happen).

On Tue, Sep 8, 2015 at 6:59 AM, Cody Koeninger <cody@koeninger.org> wrote:

> Yes, local directories will be sufficient
>
> On Sat, Sep 5, 2015 at 10:44 AM, N B <nb.nospam@gmail.com> wrote:
>
>> Hi TD,
>>
>> Thanks!
>>
>> So our application does turn on checkpoints but we do not recover upon
>> application restart (we just blow the checkpoint directory away first and
>> re-create the StreamingContext) as we don't have a real need for that type
>> of recovery. However, because the application does reduceeByKeyAndWindow
>> operations, checkpointing has to be turned on. Do you think this scenario
>> will also only work with HDFS or having local directories suffice?
>>
>> Thanks
>> Nikunj
>>
>>
>>
>> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <tdas@databricks.com>
>> wrote:
>>
>>> Shuffle spills will use local disk, HDFS not needed.
>>> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
>>> fault-tolerance. So that stuff can be recovered even if the spark cluster
>>> nodes go down.
>>>
>>> TD
>>>
>>> On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nospam@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a Spark Streaming program that is currently running on a single
>>>> node in "local[n]" master mode. We currently give it local directories for
>>>> Spark's own state management etc. The input is streaming from network/flume
>>>> and output is also to network/kafka etc, so the process as such does not
>>>> need any distributed file system.
>>>>
>>>> Now, we do want to start distributing this procesing across a few
>>>> machines and make a real cluster out of it. However, I am not sure if HDFS
>>>> is a hard requirement for that to happen. I am thinking about the Shuffle
>>>> spills, DStream/RDD persistence and checkpoint info. Do any of these
>>>> require the state to be shared via HDFS? Are there other alternatives that
>>>> can be utilized if state sharing is accomplished via the file system only.
>>>>
>>>> Thanks
>>>> Nikunj
>>>>
>>>>
>>>
>>
>

Mime
View raw message