samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Anatoly <gustavoanat...@gmail.com>
Subject Re: Reprocessing old events no longer in Kafka
Date Fri, 29 May 2015 23:08:49 GMT
Hi, Zach

In my humble opinion once I dont have much experience with Samza. But I
think that a possible solution could be hybrid. Using HBase Coprocessor to
running a Samza application to process the events and storing the results
on HBase. Thus you have the history available to query it.

Thanks.

2015-05-29 19:36 GMT-03:00 Navina Ramesh <nramesh@linkedin.com.invalid>:

> Hi Zach,
>
> Yes. Bootstrapping from hdfs directly is possible, provided you have a
> hdfs consumer defined (SAMZA-263 is a pre-requisite). :)
>
> If you have a way of getting hdfs data into Kafka, then you can use it as
> a bootstrap stream. However, a bootstrap stream is a temporarily privilege
> and always consumes all messages in the stream.
>
> If you want to use the offsets stored by Camus to determine the right
> place to transition to realtime Kafka, then you might consider
> implementing a custom MessageChooser [1]. This will give you more
> fine-grained control over message selection from your input streams.
>
> Cheers!
> Navina
>
> [1]
> https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apa
> che/samza/system/chooser/MessageChooser.java#L26
>
>
> On 5/29/15, 3:07 PM, "Zach Cox" <zcox522@gmail.com> wrote:
>
> >Hi Navina,
> >
> >Do you mean bootstrapping from hdfs as in [1]? That is an interesting idea
> >I hadn't thought of. Maybe that could be combined with the offsets stored
> >by Camus to determine the right place to transition to the real-time kafka
> >stream?
> >
> >Thanks,
> >Zach
> >
> >[1]
> >
> http://samza.apache.org/learn/documentation/0.9/container/streams.html#boo
> >tstrapping
> >
> >
> >On Fri, May 29, 2015 at 4:53 PM Navina Ramesh
> ><nramesh@linkedin.com.invalid>
> >wrote:
> >
> >> Hi Zach,
> >>
> >> I agree. It is not a good idea to keep the entire set of historical data
> >> in Kafka.
> >> The retention period in Kafka does make it trickier to synchronize with
> >> your hadoop data pump. I am not very familiar with Camus2Kafka project.
> >> But that sounds like a workable solution.
> >>
> >> Ideal solution would be to consume/bootstrap directly from HDFS :)
> >>
> >> Cheers!
> >> Navina
> >>
> >> On 5/29/15, 2:44 PM, "Zach Cox" <zcox522@gmail.com> wrote:
> >>
> >> >Hi Navina,
> >> >
> >> >A similar approach I considered was using an infinite/very large
> >>retention
> >> >period on those event kafka topics, so they would always contain all
> >> >historical events. Then standard Samza reprocessing goes through all
> >>old
> >> >events.
> >> >
> >> >I'm hesitant to pursue that though, as those topic partitions then grow
> >> >unbounded over time, which seems problematic.
> >> >
> >> >Thanks,
> >> >Zach
> >> >
> >> >On Fri, May 29, 2015 at 4:34 PM Navina Ramesh
> >> ><nramesh@linkedin.com.invalid>
> >> >wrote:
> >> >
> >> >> That said, since we donĀ¹t yet support consuming from hdfs, one
> >> >>workaround
> >> >> would be to periodically read from hdfs and pump the data to a kafka
> >> >>topic
> >> >> (say topic A) using a hadoop / yarn based job. Then, in your Samza
> >>job,
> >> >> you can bootstrap from topic A and then, continue processing the
> >>latest
> >> >> messages from the other Kafka topic.
> >> >>
> >> >> Thanks!
> >> >> Navina
> >> >>
> >> >> On 5/29/15, 2:26 PM, "Navina Ramesh" <nramesh@linkedin.com> wrote:
> >> >>
> >> >> >Hi Zach,
> >> >> >
> >> >> >It sounds like you are asking for a SystemConsumer for hdfs. Does
> >> >> >SAMZA-263 match your requirements?
> >> >> >
> >> >> >Thanks!
> >> >> >Navina
> >> >> >
> >> >> >On 5/29/15, 2:23 PM, "Zach Cox" <zcox522@gmail.com> wrote:
> >> >> >
> >> >> >>(continuing from previous email) in addition to not wanting
to
> >> >>duplicate
> >> >> >>code, say that some of the Samza jobs need to build up state,
and
> >>it's
> >> >> >>important to build up this state from all of those old events
no
> >> >>longer
> >> >> >>in
> >> >> >>Kafka. If that state was only built from the last 7 days of
events,
> >> >>some
> >> >> >>things would be missing and the data would be incomplete.
> >> >> >>
> >> >> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox522@gmail.com>
> wrote:
> >> >> >>
> >> >> >>> Let's also add to the story: say the company wants to
only write
> >> >>code
> >> >> >>>for
> >> >> >>> Samza, and not duplicate the same code in MapReduce jobs
(or any
> >> >>other
> >> >> >>> framework).
> >> >> >>>
> >> >> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b@b3k.us>
wrote:
> >> >> >>>
> >> >> >>>> Why not run a map reduce job on the data in hdfs?
what is was
> >>made
> >> >> >>>>for.
> >> >> >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox522@gmail.com>
wrote:
> >> >> >>>>
> >> >> >>>> > Hi -
> >> >> >>>> >
> >> >> >>>> > Let's say one day a company wants to start doing
all of this
> >> >>awesome
> >> >> >>>> data
> >> >> >>>> > integration/near-real-time stream processing
stuff, so they
> >>start
> >> >> >>>> sending
> >> >> >>>> > their user activity events (e.g. pageviews, ad
impressions,
> >>etc)
> >> >>to
> >> >> >>>> Kafka.
> >> >> >>>> > Then they hook up Camus to copy new events from
Kafka to HDFS
> >> >>every
> >> >> >>>> hour.
> >> >> >>>> > They use the default Kafka log retention period
of 7 days. So
> >> >>after
> >> >> >>>>a
> >> >> >>>> few
> >> >> >>>> > months, Kafka has the last 7 days of events,
and HDFS has all
> >> >>events
> >> >> >>>> except
> >> >> >>>> > the newest events not yet transferred by Camus.
> >> >> >>>> >
> >> >> >>>> > Then the company wants to build out a system
that uses Samza
> >>to
> >> >> >>>>process
> >> >> >>>> the
> >> >> >>>> > user activity events from Kafka and output it
to some
> >>queryable
> >> >>data
> >> >> >>>> store.
> >> >> >>>> > If standard Samza reprocessing [1] is used, then
only the
> >>last 7
> >> >> >>>>days of
> >> >> >>>> > events in Kafka get processed and put into the
data store. Of
> >> >> >>>>course,
> >> >> >>>> then
> >> >> >>>> > all future events also seamlessly get processed
by the Samza
> >>jobs
> >> >> >>>>and
> >> >> >>>> put
> >> >> >>>> > into the data store, which is awesome.
> >> >> >>>> >
> >> >> >>>> > But let's say this company needs all of the historical
events
> >>to
> >> >>be
> >> >> >>>> > processed by Samza and put into the data store
(i.e. the
> >>events
> >> >> >>>>older
> >> >> >>>> than
> >> >> >>>> > 7 days that are in HDFS but no longer in Kafka).
It's a
> >>Business
> >> >> >>>> Critical
> >> >> >>>> > thing and absolutely must happen. How should
this company
> >>achieve
> >> >> >>>>this?
> >> >> >>>> >
> >> >> >>>> > I'm sure there are many potential solutions to
this problem,
> >>but
> >> >>has
> >> >> >>>> anyone
> >> >> >>>> > actually done this? What approach did you take?
> >> >> >>>> >
> >> >> >>>> > Any experiences or thoughts would be hugely appreciated.
> >> >> >>>> >
> >> >> >>>> > Thanks,
> >> >> >>>> > Zach
> >> >> >>>> >
> >> >> >>>> > [1]
> >> >> >>>>
> >> >>
> >>http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
> >> >> >>>> >
> >> >> >>>>
> >> >> >>>
> >> >> >
> >> >>
> >> >>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message