samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix GV <fville...@linkedin.com.INVALID>
Subject RE: Reprocessing old events no longer in Kafka
Date Fri, 29 May 2015 21:40:19 GMT
Even if reading directly from HDFS, the matter of transitioning from re-reprocessing back to
real-time is a bit problematic unless you are tapping into data ingested by Camus (or something
else which has offset metadata recorded alongside the data).

Granted, Kafka already has just "at least once" delivery guarantees, so you might argue it
doesn't matter if you process all historical data up to 1 hour ago + the 7 previous days of
real time data. If your stream processing use case is idempotent, then it indeed does not
matter. But if your use case is really a use case that prefers exactly once delivery but can
tolerate the imprecision of an occasional duplicate or three, then the 7 days of dupes approach
falls a bit short...

--

Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn

fgv@linkedin.com
linkedin.com/in/felixgv

________________________________________
From: Zach Cox [zcox522@gmail.com]
Sent: Friday, May 29, 2015 2:33 PM
To: dev@samza.apache.org
Subject: Re: Reprocessing old events no longer in Kafka

Hi Navina,

I did see that jira and it would definitely be useful. I was thinking of
maybe trying to build a composite stream, that would first read old events
from hdfs and then switch over to kafka.

Do you know if there has been any movement on treating hdfs as a samza
stream?

Thanks,
Zach

On Fri, May 29, 2015 at 4:27 PM Navina Ramesh <nramesh@linkedin.com.invalid>
wrote:

> Hi Zach,
>
> It sounds like you are asking for a SystemConsumer for hdfs. Does
> SAMZA-263 match your requirements?
>
> Thanks!
> Navina
>
> On 5/29/15, 2:23 PM, "Zach Cox" <zcox522@gmail.com> wrote:
>
> >(continuing from previous email) in addition to not wanting to duplicate
> >code, say that some of the Samza jobs need to build up state, and it's
> >important to build up this state from all of those old events no longer in
> >Kafka. If that state was only built from the last 7 days of events, some
> >things would be missing and the data would be incomplete.
> >
> >On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox522@gmail.com> wrote:
> >
> >> Let's also add to the story: say the company wants to only write code
> >>for
> >> Samza, and not duplicate the same code in MapReduce jobs (or any other
> >> framework).
> >>
> >> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b@b3k.us> wrote:
> >>
> >>> Why not run a map reduce job on the data in hdfs? what is was made for.
> >>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox522@gmail.com> wrote:
> >>>
> >>> > Hi -
> >>> >
> >>> > Let's say one day a company wants to start doing all of this awesome
> >>> data
> >>> > integration/near-real-time stream processing stuff, so they start
> >>> sending
> >>> > their user activity events (e.g. pageviews, ad impressions, etc) to
> >>> Kafka.
> >>> > Then they hook up Camus to copy new events from Kafka to HDFS every
> >>> hour.
> >>> > They use the default Kafka log retention period of 7 days. So after
a
> >>> few
> >>> > months, Kafka has the last 7 days of events, and HDFS has all events
> >>> except
> >>> > the newest events not yet transferred by Camus.
> >>> >
> >>> > Then the company wants to build out a system that uses Samza to
> >>>process
> >>> the
> >>> > user activity events from Kafka and output it to some queryable data
> >>> store.
> >>> > If standard Samza reprocessing [1] is used, then only the last 7
> >>>days of
> >>> > events in Kafka get processed and put into the data store. Of course,
> >>> then
> >>> > all future events also seamlessly get processed by the Samza jobs and
> >>> put
> >>> > into the data store, which is awesome.
> >>> >
> >>> > But let's say this company needs all of the historical events to be
> >>> > processed by Samza and put into the data store (i.e. the events older
> >>> than
> >>> > 7 days that are in HDFS but no longer in Kafka). It's a Business
> >>> Critical
> >>> > thing and absolutely must happen. How should this company achieve
> >>>this?
> >>> >
> >>> > I'm sure there are many potential solutions to this problem, but has
> >>> anyone
> >>> > actually done this? What approach did you take?
> >>> >
> >>> > Any experiences or thoughts would be hugely appreciated.
> >>> >
> >>> > Thanks,
> >>> > Zach
> >>> >
> >>> > [1]
> >>> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
> >>> >
> >>>
> >>
>
>

Mime
View raw message