kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Schierbeck <daniel.schierb...@gmail.com>
Subject Re: Using Kafka as a persistent store
Date Sat, 11 Jul 2015 20:22:24 GMT
Radek: I don't see how data could be stored more efficiently than in Kafka
itself. It's optimized for cheap storage and offers high-performance bulk
export, exactly what you want from long-term archival.
On fre. 10. jul. 2015 at 23.16 Rad Gruchalski <radek@gruchalski.com> wrote:

> Hello all,
>
> This is a very interesting discussion. I’ve been thinking of a similar use
> case for Kafka over the last few days.
> The usual data workflow with Kafka is most likely something this:
>
> - ingest with Kafka
> - process with Storm / Samza / whathaveyou
>   - put some processed data back on Kafka
>   - at the same time store the raw data somewhere in case if everything
> has to be reprocessed in the future (hdfs, similar?)
>
> Currently Kafka offers a couple of types of topics: regular stream
> (non-compacted topic) and a compacted topic (key/value). In case of a
> stream topic, when the compaction kicks in, the “old” data is truncated. It
> is lost from Kafka. What if there was an additional compaction setting:
> cold-store.
> Instead of trimming old data, Kafka would compile old data into a separate
> log with its own index. The user would be free to decide what to do with
> such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file
> is not needed. The only 3 things are:
>
>  - the folder name / partition index
>  - the log itself
>  - topic metadata at the time of taking the data out of the segment
>
> With all this info, reading data back is fairly easy, even without
> starting Kafka, sample program goes like this (scala-ish):
>
>     val props = new Properties()
>     props.put("log.segment.bytes", "1073741824")
>     props.put("segment.index.bytes", "10485760") // should be 10MB
>
>     val log = new Log(
>       new File(“/somestorage/kafka-test-0"),
>       cfg,
>       0L,
>       null )
>
>     val fdi = log.activeSegment.read( log.logStartOffset,
> Some(log.logEndOffset), 1000000 )
>     var msgs = 1
>     fdi.messageSet.iterator.foreach { msgoffset =>
>       println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::>
> ${msgoffset.offset} :::::: ${msgoffset.nextOffset}" )
>       msgs = msgs + 1
>       val key = new String( msgoffset.message.key.array(), "UTF-8")
>       val msg = new String( msgoffset.message.payload.array(), "UTF-8")
>       println( s" === ${key} " )
>       println( s" === ${msg} " )
>     }
>
>
> This reads from active segment (the last known segment) but it’s easy to
> make it read from all segments. The interesting thing is - as long as the
> back up files are well formed, they can be read without having to put them
> in Kafka itself.
>
> The advantage is: what was once the raw data (as it came in), is the raw
> data forever, without having to introduce another format for storing this.
> Another advantage is: in case of reprocessing, no need to write a producer
> to ingest the data back and so on, so on (it’s possible but not necessary).
> Such raw Kafka files can be easily processed by Storm / Samza (would need
> another stream definition) / Hadoop.
>
> This sounds like a very useful addition to Kafka. But I could be
> overthinking this...
>
>
>
>
>
>
>
>
>
>
> Kind regards,
> Radek Gruchalski
> radek@gruchalski.com (mailto:radek@gruchalski.com) (mailto:
> radek@gruchalski.com)
> de.linkedin.com/in/radgruchalski/ (
> http://de.linkedin.com/in/radgruchalski/)
>
> Confidentiality:
> This communication is intended for the above-named person and may be
> confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
>
>
> On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:
>
> >
> > > On 10. jul. 2015, at 15.16, Shayne S <shaynest113@gmail.com (mailto:
> shaynest113@gmail.com)> wrote:
> > >
> > > There are two ways you can configure your topics, log compaction and
> with
> > > no cleaning. The choice depends on your use case. Are the records
> uniquely
> > > identifiable and will they receive updates? Then log compaction is the
> way
> > > to go. If they are truly read only, you can go without log compaction.
> > >
> >
> >
> > I'd rather be free to use the key for partitioning, and the records are
> immutable — they're event records — so disabling compaction altogether
> would be preferable. How is that accomplished?
> > >
> > > We have a small processes which consume a topic and perform upserts to
> our
> > > various database engines. It's easy to change how it all works and
> simply
> > > consume the single source of truth again.
> > >
> > > I've written a bit about log compaction here:
> > >
> http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > >
> > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > daniel.schierbeck@gmail.com (mailto:daniel.schierbeck@gmail.com)>
> wrote:
> > >
> > > > I'd like to use Kafka as a persistent store – sort of as an
> alternative to
> > > > HDFS. The idea is that I'd load the data into various other systems
> in
> > > > order to solve specific needs such as full-text search, analytics,
> indexing
> > > > by various attributes, etc. I'd like to keep a single source of
> truth,
> > > > however.
> > > >
> > > > I'm struggling a bit to understand how I can configure a topic to
> retain
> > > > messages indefinitely. I want to make sure that my data isn't
> deleted. Is
> > > > there a guide to configuring Kafka like this?
> > > >
> > >
> > >
> >
> >
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message