spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <>
Subject Re: Is HDFS required for Spark streaming?
Date Tue, 08 Sep 2015 13:59:28 GMT
Yes, local directories will be sufficient

On Sat, Sep 5, 2015 at 10:44 AM, N B <> wrote:

> Hi TD,
> Thanks!
> So our application does turn on checkpoints but we do not recover upon
> application restart (we just blow the checkpoint directory away first and
> re-create the StreamingContext) as we don't have a real need for that type
> of recovery. However, because the application does reduceeByKeyAndWindow
> operations, checkpointing has to be turned on. Do you think this scenario
> will also only work with HDFS or having local directories suffice?
> Thanks
> Nikunj
> On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <> wrote:
>> Shuffle spills will use local disk, HDFS not needed.
>> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
>> fault-tolerance. So that stuff can be recovered even if the spark cluster
>> nodes go down.
>> TD
>> On Fri, Sep 4, 2015 at 2:45 PM, N B <> wrote:
>>> Hello,
>>> We have a Spark Streaming program that is currently running on a single
>>> node in "local[n]" master mode. We currently give it local directories for
>>> Spark's own state management etc. The input is streaming from network/flume
>>> and output is also to network/kafka etc, so the process as such does not
>>> need any distributed file system.
>>> Now, we do want to start distributing this procesing across a few
>>> machines and make a real cluster out of it. However, I am not sure if HDFS
>>> is a hard requirement for that to happen. I am thinking about the Shuffle
>>> spills, DStream/RDD persistence and checkpoint info. Do any of these
>>> require the state to be shared via HDFS? Are there other alternatives that
>>> can be utilized if state sharing is accomplished via the file system only.
>>> Thanks
>>> Nikunj

View raw message