kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Garcia <dav...@spiceworks.com>
Subject Re: Partitioning at the edges
Date Wed, 07 Sep 2016 20:41:43 GMT
The “simplest” way to solve this is to “repartition” your data (i.e. the streams you
wish to join) with the partition key you wish to join on.  This obviously introduces redundancy,
but it will solve your problem.  For example.. suppose you want to join topic T1 and topic
T2…but they aren’t partitioned on the key you need.  You could write two “simple”
repartition jobs for each topic (you can actually do this with one job):

T1 -> Job_T1 -> T1’
T2 -> Job_T2 -> T2’

T1’ and T2’ would be partitioned on your join key and would have the same number of partitions
so that you have the guarantees you need to do the join.  (i.e. join T1’ and T2’).


On 9/2/16, 8:43 PM, "Andy Chambers" <achambers.home@gmail.com> wrote:

    Hey Folks,
    We are having quite a bit trouble modelling the flow of data through a very
    kafka centric system
    As I understand it, every stream you might want to join with another must
    be partitioned the same way. But often streams at the edges of a system
    *cannot* be partitioned the same way because they don't have the partition
    key yet (often the work for this process is to find the key in some lookup
    table based on some other key we don't control).
    We have come up with a few solutions but everything seems to add complexity
    and backs our designs into a corner.
    What is frustrating is that most of the data is not really that big but we
    have a handful of topics we expect to require a lot of throughput.
    Is this just unavoidable complexity asociated with scale or am I thinking
    about this in the wrong way. We're going all in on the "turning the
    database inside out" architecture but we end up spending more time thinking
    about how stuff gets broken up into tasks and distributed than we are about
    our business.
    Do these problems seem familiar to anyone else?  Did you find any patterns
    that helped keep the complexity down.

View raw message