samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin>
Subject Re: Reprocessing old events no longer in Kafka
Date Fri, 29 May 2015 21:16:12 GMT
Why not run a map reduce job on the data in hdfs? what is was made for.
On May 29, 2015 2:13 PM, "Zach Cox" <> wrote:

> Hi -
> Let's say one day a company wants to start doing all of this awesome data
> integration/near-real-time stream processing stuff, so they start sending
> their user activity events (e.g. pageviews, ad impressions, etc) to Kafka.
> Then they hook up Camus to copy new events from Kafka to HDFS every hour.
> They use the default Kafka log retention period of 7 days. So after a few
> months, Kafka has the last 7 days of events, and HDFS has all events except
> the newest events not yet transferred by Camus.
> Then the company wants to build out a system that uses Samza to process the
> user activity events from Kafka and output it to some queryable data store.
> If standard Samza reprocessing [1] is used, then only the last 7 days of
> events in Kafka get processed and put into the data store. Of course, then
> all future events also seamlessly get processed by the Samza jobs and put
> into the data store, which is awesome.
> But let's say this company needs all of the historical events to be
> processed by Samza and put into the data store (i.e. the events older than
> 7 days that are in HDFS but no longer in Kafka). It's a Business Critical
> thing and absolutely must happen. How should this company achieve this?
> I'm sure there are many potential solutions to this problem, but has anyone
> actually done this? What approach did you take?
> Any experiences or thoughts would be hugely appreciated.
> Thanks,
> Zach
> [1]

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message