spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Shreedharan (JIRA)" <>
Subject [jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
Date Tue, 29 Apr 2014 20:21:14 GMT


Hari Shreedharan commented on SPARK-1645:

Yes, so I have a rough design for that in mind. The idea is to add a sink which plugs into
Flume, which gets polled by the Spark receiver. That way, even if the node on which the worker
is running fails, the receiver on another node can poll the sink and pull data. From the Flume
point of view, the sink does not "conform" to the definition of standard sinks (all Flume
sinks are push only), but it can be written such that we don't lose data. Later if/when Flume
adds support for pollable sinks this sink can be ported.

> Improve Spark Streaming compatibility with Flume
> ------------------------------------------------
>                 Key: SPARK-1645
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>            Reporter: Hari Shreedharan
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, else Flume
cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and
a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The new receiver
should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, not just Flume.
I will file a separate jira for this and we should work on it there. This is a longer term
project and requires considerable development work.
> I intend to start working on these soon. Any input is appreciated. (It'd be great if
someone can add me as a contributor on jira, so I can assign the jira to myself).

This message was sent by Atlassian JIRA

View raw message