kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Henke <ghe...@cloudera.com>
Subject Re: Topic per entity
Date Mon, 02 Nov 2015 16:44:03 GMT
Hi Alex & Andrew,

There was a discussion with some pointers on this mailing list a bit ago
titled "mapping events to topics". I suggest taking a look at that thread:

If you still have questions, don't hesitate to ask.


On Sat, Oct 31, 2015 at 3:19 AM, Andrew Stevenson <astevenson@outlook.com>

> I too would be interested in any responses to this question.
> I'm using kafka for event notification and once secure will put real
> payload in it and take advantage of the durable commit log. I want to let
> users describe a DAG in orientdb and have the Kafka Client processor load
> and execute it. Each processor would then attach it's lineage and
> provenance back to the orientdbs graph store.
> This way I can let users replay stress scenarios, calculate VaR etc with
> one source of replayable truth. Compliance and regulatory authorities like
> this.
> Regards
> Andrew
> ________________________________
> From: Alex Buchanan<mailto:buchanae@gmail.com>
> Sent: ‎31/‎10/‎2015 05:30
> To: users@kafka.apache.org<mailto:users@kafka.apache.org>
> Subject: Topic per entity
> Hey Kafka community.
> I'm researching possible architecture for a distributed data processing
> system. In this system, there's a close relationship between a specific
> dataset and the processing code. The user might upload a few datasets and
> write code to run analysis on that data. In other words, frequently the
> analysis code pulls data from a specific entity.
> Kafka is attractive for lots of reasons:
> - I'll need messaging anyway
> - I want a model for immutability of data (mutable state and potential job
> failure don't mix)
> - cross-language clients
> - the change stream concept could have some nice uses (such as updating
> visualizations without rebuilding)
> - Samza's model of state management is a simple way to think of external
> data without worrying too-much about network-based RPC
> - as a source of truth data store, it's really simple. No mutability,
> complex queries, etc. Just a log. To me, that helps prevent abuse and
> mistakes.
> - it fits well with the concept of pipes, frequently found in data analysis
> But most of the Kafka examples are about processing a large stream of a
> specific _type_, not so much about processing specific entities. And I
> understand there are limits to topics (file/node limits on the filesystem
> and in zookeeper) and it's discouraged to model topics based on
> characteristics of data. In this system, it feels more natural to have a
> topic per entity so the processing code can connect directly to the data it
> wants.
> So I need a little guidance from smart people. Am I lost in the rabbit
> hole? Maybe I'm trying to force Kafka into this territory it's not suited
> for. Have I been reading too many (awesome) articles about the role of the
> log and streaming in distributed computing? Or am I on the right track and
> I just need to put in some work to jump the hurdles (such as topic storage
> and coordination)?
> It sounds like Cassandra might be another good option, but I don't know
> much about it yet.
> Thanks guys!

Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message