kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Chambers <achambers.h...@gmail.com>
Subject Re: Partitioning at the edges
Date Sat, 03 Sep 2016 16:57:53 GMT
Hi Eno,

I'll try. We have a feed of transaction data from the bank. Each of which
we must try to associate with a customer in our system. Unfortunately the
transaction data doesn't include the customer-id itself but rather a
variety of other identifiers that we can use to lookup the customer-id in a
mapping table (this is also in kafka)

Once we have the customer-id, we produce some records to a "ledger-request"
topic (which is partitioned by customer-id). The exact sequence of records
produced depends on the type of transaction, and whether we were able to
find the customer-id but if all goes well, the ledger will produce events
on a "ledger-result" topic for each request record.

This is where we have a problem. The ledger is the bit that has to be super
performant so these topics must be highly partitioned on customer-id. But
then we'd like to join the results with the original bank-transaction
(which remember doesn't even have a customer-id) so we can mark it as being
handled or not when the result comes back.

Current plan is to just use a single partition for most topics but then
"wrap" the ledger system with a process that re-partitions it's input as
necessary for scale.

Cheers,
Andy

On Sat, Sep 3, 2016 at 6:13 AM, Eno Thereska <eno.thereska@gmail.com> wrote:

> Hi Andy,
>
> Could you share a bit more info or pseudocode so that we can understand
> the scenario a bit better? Especially around the streams at the edges. How
> are they created and what is the join meant to do?
>
> Thanks
> Eno
>
> > On 3 Sep 2016, at 02:43, Andy Chambers <achambers.home@gmail.com> wrote:
> >
> > Hey Folks,
> >
> > We are having quite a bit trouble modelling the flow of data through a
> very
> > kafka centric system
> >
> > As I understand it, every stream you might want to join with another must
> > be partitioned the same way. But often streams at the edges of a system
> > *cannot* be partitioned the same way because they don't have the
> partition
> > key yet (often the work for this process is to find the key in some
> lookup
> > table based on some other key we don't control).
> >
> > We have come up with a few solutions but everything seems to add
> complexity
> > and backs our designs into a corner.
> >
> > What is frustrating is that most of the data is not really that big but
> we
> > have a handful of topics we expect to require a lot of throughput.
> >
> > Is this just unavoidable complexity asociated with scale or am I thinking
> > about this in the wrong way. We're going all in on the "turning the
> > database inside out" architecture but we end up spending more time
> thinking
> > about how stuff gets broken up into tasks and distributed than we are
> about
> > our business.
> >
> > Do these problems seem familiar to anyone else?  Did you find any
> patterns
> > that helped keep the complexity down.
> >
> > Cheers
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message