kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwen Shapira <gshap...@cloudera.com>
Subject Re: Pulling Snapshots from Kafka, Log compaction last compact offset
Date Sun, 10 May 2015 06:48:44 GMT
Hi Jonathan,

I agree we can have topic-per-table, but some transactions may span
multiple tables and therefore will get applied partially out-of-order. I
suspect this can be a consistency issue and create a state that is
different than the state in the original database, but I don't have good
proof of it.

I know that Oracle Streams has "Parallel Apply" feature where they figure
out whether transactions have dependencies and apply in parallel only if
they don't. So it sounds like dependencies may be an issue.

Planning to give this more thought :)

Gwen

On Fri, May 1, 2015 at 7:56 PM, Jonathan Hodges <hodgesz@gmail.com> wrote:

> Hi Gwen,
>
> As you said I see Bottled Water and Sqoop managing slightly different use
> cases so I don't see this feature as a Sqoop killer.  However I did have a
> question on your comment that the transaction log or CDC approach will have
> problems with very large, very active databases.
>
> I get that you need to have a single producer that transmits the
> transaction log changes to Kafka in order.  However on the consumer side
> you can have a topic per table and then partition these topics by primary
> key to achieve nice parallelism.  So it seems the producer is the potential
> bottleneck, but I imagine you can scale that appropriately vertically and
> put the proper HA.
>
> Would love to hear your thoughts on this.
>
> Jonathan
>
>
>
> On Thu, Apr 30, 2015 at 5:09 PM, Gwen Shapira <gshapira@cloudera.com>
> wrote:
>
> > I feel a need to respond to the Sqoop-killer comment :)
> >
> > 1) Note that most databases have a single transaction log per db and in
> > order to get the correct view of the DB, you need to read it in order
> > (otherwise transactions will get messed up). This means you are limited
> to
> > a single producer reading data from the log, writing it to a single
> > partition and getting it read from a single consumer. If the database is
> > very large and very active, you may run into some issues there...
> >
> > Because Sqoop doesn't try to catch up with all the changes, but takes a
> > snapshot (from multiple mappers in parallel), we can very rapidly Sqoop
> > 10TB databases.
> >
> > 2) If HDFS is the target of getting data from Postgres, then postgresql
> ->
> > kafka -> HDFS seems less optimal than postgresql -> HDFS directly (in
> > parallel). There are good reasons to get Postgres data to Kafka, but if
> the
> > eventual goal is HDFS (or HBase), I suspect Sqoop still has a place.
> >
> > 3) Due to its parallelism and general purpose JDBC connector, I suspect
> > that Sqoop is even a very viable way of getting data into Kafka.
> >
> > Gwen
> >
> >
> > On Thu, Apr 30, 2015 at 2:27 PM, Jan Filipiak <Jan.Filipiak@trivago.com>
> > wrote:
> >
> > > Hello Everyone,
> > >
> > > I am quite exited about the recent example of replicating PostgresSQL
> > > Changes to Kafka. My view on the log compaction feature always had
> been a
> > > very sceptical one, but now with its great potential exposed to the
> wide
> > > public, I think its an awesome feature. Especially when pulling this
> data
> > > into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I want to thank
> > > everyone who had the vision of building these kind of systems during a
> > time
> > > I could not imagine those.
> > >
> > > There is one open question that I would like people to help me with.
> When
> > > pulling a snapshot of a partition into HDFS using a camus-like
> > application
> > > I feel the need of keeping a Set of all keys read so far and stop as
> soon
> > > as I find a key beeing already in my set. I use this as an indicator of
> > how
> > > far the log compaction has happened already and only pull up to this
> > point.
> > > This works quite well as I do not need to keep the messages but only
> > their
> > > keys in memory.
> > >
> > > The question I want to raise with the community is:
> > >
> > > How do you prevent pulling the same record twice (in different
> versions)
> > > and would it be beneficial if the "OffsetResponse" would also return
> the
> > > last offset that got compacted so far and the application would just
> pull
> > > up to this point?
> > >
> > > Looking forward for some recommendations and comments.
> > >
> > > Best
> > > Jan
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message