kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Architecture recommendations for a tricky use case
Date Thu, 29 Sep 2016 18:51:05 GMT
The OP didn't say anything about Yarn, and why are you contemplating
putting Kafka or Spark on public networks to begin with?

Gwen's right, absent any actual requirements this is kind of pointless.

On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
<msegel_hadoop@hotmail.com> wrote:
> Spark standalone is not Yarn… or secure for that matter… ;-)
>
>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <cody@koeninger.org> wrote:
>>
>> Spark streaming helps with aggregation because
>>
>> A. raw kafka consumers have no built in framework for shuffling
>> amongst nodes, short of writing into an intermediate topic (I'm not
>> touching Kafka Streams here, I don't have experience), and
>>
>> B. it deals with batches, so you can transactionally decide to commit
>> or rollback your aggregate data and your offsets.  Otherwise your
>> offsets and data store can get out of sync, leading to lost /
>> duplicate data.
>>
>> Regarding long running spark jobs, I have streaming jobs in the
>> standalone manager that have been running for 6 months or more.
>>
>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>> <msegel_hadoop@hotmail.com> wrote:
>>> Ok… so what’s the tricky part?
>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in processing…
it would work.
>>>
>>> The drawback is that you now have a long running Spark Job (assuming under YARN)
and that could become a problem in terms of security and resources.
>>> (How well does Yarn handle long running jobs these days in a secured Cluster?
Steve L. may have some insight… )
>>>
>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you
want to write your own compaction code? Or use Hive 1.x+?)
>>>
>>> HBase? Depending on your admin… stability could be a problem.
>>> Cassandra? That would be a separate cluster and that in itself could be a problem…
>>>
>>> YMMV so you need to address the pros/cons of each tool specific to your environment
and skill level.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac200@gmail.com> wrote:
>>>>
>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>
>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw data
into Kafka.
>>>>
>>>> I need to:
>>>>
>>>> - Do ETL on the data, and standardize it.
>>>>
>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch
/ Postgres)
>>>>
>>>> - Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)
>>>>
>>>> Java is being used as the backend language for everything (backend of the
web UI, as well as the ETL layer)
>>>>
>>>> I'm considering:
>>>>
>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
raw data from Kafka, standardize & store it)
>>>>
>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
and to allow queries
>>>>
>>>> - In the backend of the web UI, I could either use Spark to run queries across
the data (mostly filters), or directly run queries against Cassandra / HBase
>>>>
>>>> I'd appreciate some thoughts / suggestions on which of these alternatives
I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which persistent data store
to use, and how to query that data store in the backend of the web UI, for displaying the
reports).
>>>>
>>>>
>>>> Thanks.
>>>
>

Mime
View raw message