kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Garcia <dav...@spiceworks.com>
Subject Re: Partitioning at the edges
Date Wed, 07 Sep 2016 20:43:51 GMT
Obviously for the keys you don’t have, you would have to look them up…sorry, I kinda missed
that part.  That is indeed a pain.  The job that looks those keys up would probably have to
batch queries to the external system.  Maybe you could use kafka-connect-jdbc to stream in
updates to that system?

-David


On 9/7/16, 3:41 PM, "David Garcia" <davidg@spiceworks.com> wrote:

    The “simplest” way to solve this is to “repartition” your data (i.e. the streams
you wish to join) with the partition key you wish to join on.  This obviously introduces redundancy,
but it will solve your problem.  For example.. suppose you want to join topic T1 and topic
T2…but they aren’t partitioned on the key you need.  You could write two “simple”
repartition jobs for each topic (you can actually do this with one job):
    
    T1 -> Job_T1 -> T1’
    T2 -> Job_T2 -> T2’
    
    T1’ and T2’ would be partitioned on your join key and would have the same number of
partitions so that you have the guarantees you need to do the join.  (i.e. join T1’ and
T2’).
    
    -David
    
    
    On 9/2/16, 8:43 PM, "Andy Chambers" <achambers.home@gmail.com> wrote:
    
        Hey Folks,
        
        We are having quite a bit trouble modelling the flow of data through a very
        kafka centric system
        
        As I understand it, every stream you might want to join with another must
        be partitioned the same way. But often streams at the edges of a system
        *cannot* be partitioned the same way because they don't have the partition
        key yet (often the work for this process is to find the key in some lookup
        table based on some other key we don't control).
        
        We have come up with a few solutions but everything seems to add complexity
        and backs our designs into a corner.
        
        What is frustrating is that most of the data is not really that big but we
        have a handful of topics we expect to require a lot of throughput.
        
        Is this just unavoidable complexity asociated with scale or am I thinking
        about this in the wrong way. We're going all in on the "turning the
        database inside out" architecture but we end up spending more time thinking
        about how stuff gets broken up into tasks and distributed than we are about
        our business.
        
        Do these problems seem familiar to anyone else?  Did you find any patterns
        that helped keep the complexity down.
        
        Cheers
        
    
    
    


Mime
View raw message