samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Song <chen.song...@gmail.com>
Subject Re: Re: Samza processing reference data
Date Thu, 29 Oct 2015 03:01:39 GMT
Thanks Yan & Yi.

On Wed, Oct 28, 2015 at 11:00 AM, Yi Pan <nickpan47@gmail.com> wrote:

> Hi, Chen,
>
>
>
> On Wed, Oct 28, 2015 at 4:05 AM, Yan Fang <yanfangwork@163.com> wrote:
>
> >
> >
> > * Is there a tentative date for 0.10.0 release?
> >     I think it's coming out soon. @Yi Pan , he should know more about
> that.
> >
>
> There is a bit delay on the release date due to a recent bug we discovered
> in test. The targeted date would be in Nov.
>
> >
> >
> > * I checked the checkpoint topic for Samza job and it seems the
> checkpoint
> > topic is created with1 partition by default. Given that each Samza task
> > will need to read from checkpoint topic, it is similar to what I need to
> > read (Each Samza task is reading from the same partition of a topic). I
> am
> > wondering how is that achieved?
> >     In current implementation, only the AM reads the checkpoint stream
> and
> > distribute the information to all the nodes using the http server. Not
> all
> > the nodes are consuming the checkpoint stream. Correct me if I am wrong.
> >
>
> The checkpoint topic is a special one that the containers only read during
> the start up phase. Hence, it is not considered as part of the
> SystemStreamPartitions that are assigned to the tasks. As Yan mentioned,
> broadcast stream in 0.10 is the solution to your use case.
>
> Thanks!
>
>
> >
> >
> > Thanks,
> > Yan
> >
> >
> >
> >
> >
> >
> > At 2015-10-28 02:49:23, "Chen Song" <chen.song.82@gmail.com> wrote:
> > >Thanks Yan.
> > >
> > >* Is there a tentative date for 0.10.0 release?
> > >* I checked the checkpoint topic for Samza job and it seems the
> checkpoint
> > >topic is created with1 partition by default. Given that each Samza task
> > >will need to read from checkpoint topic, it is similar to what I need to
> > >read (Each Samza task is reading from the same partition of a topic). I
> am
> > >wondering how is that achieved?
> > >
> > >Chen
> > >
> > >On Sat, Oct 24, 2015 at 5:52 AM, Yan Fang <yanfangwork@163.com> wrote:
> > >
> > >> Hi Chen Song,
> > >>
> > >>
> > >> Sorry for the late reply. What you describe is a typical bootstrap use
> > >> case. Check
> > >>
> http://samza.apache.org/learn/documentation/0.9/container/streams.html
> > ,
> > >> the bootstrap configuration. By using this one, Samza will always read
> > the
> > >> *topicR* from the beginning when it restarts. And then it treats the
> > >> *topicR* as a normal topic after reading existing msgs in the
> *topicD*.
> > >>
> > >>
> > >> == can we configure each individual Samza task to read data from all
> > >> partitions from a topic?
> > >> It works in the 0.10.0 by using the broadcast stream. In the 0.9.0,
> you
> > >> have to "create topicR with the same number of partitions as *topicD*,
> > and
> > >> replicate data to all partitions".
> > >>
> > >>
> > >> Hope this still helps.
> > >>
> > >>
> > >> Thanks,
> > >> Yan
> > >>
> > >>
> > >> At 2015-10-22 04:44:41, "Chen Song" <chen.song.82@gmail.com> wrote:
> > >> >In our samza app, we need to read data from MySQL (reference table)
> > with a
> > >> >stream. So the requirements are
> > >> >
> > >> >* Read data into each Samza task before processing any message.
> > >> >* The Samza task should be able to listen to updates happening in
> > MySQL.
> > >> >
> > >> >I did some research after scanning through some relevant
> conversations
> > and
> > >> >JIRAs on the community but did not find a solution yet. Neither I
> find
> > a
> > >> >recommended way to do this.
> > >> >
> > >> >If my data streams comes from a topic called *topicD*, options in my
> > mind
> > >> >are:
> > >> >
> > >> >   - Use Kafka
> > >> >      1. Use one of CDC based solution to replicate data in MySQL to
> a
> > >> >      topic Kafka.
> > https://github.com/wushujames/mysql-cdc-projects/wiki.
> > >> >      Say the topic is called *topicR*.
> > >> >      2. In my Samza app, read reference table from *topicR *and
> > persisted
> > >> >      in a cache in each Samza task's local storage.
> > >> >         - If the data in *topicR *is NOT partitioned in the same way
> > as
> > >> >         *topicD*, can we configure each individual Samza task to
> read
> > >> data
> > >> >         from all partitions from a topic?
> > >> >         - If the answer to the above question is no, do I need to
> > >> >create *topicR
> > >> >         *with the same number of partitions as *topicD*, and
> replicate
> > >> >         data to all partitions?
> > >> >         - On start, how to make Samza task to block processing the
> > first
> > >> >         message from *topicD* before reading all data from *topicR*.
> > >> >      3. Any new updates/deletes to *topicR *will be consumed to
> update
> > >> the
> > >> >      local cache of each Samza task.
> > >> >      4. On failure or restarts, each Samza task will read from the
> > >> >      beginning from *topicR*.
> > >> >   - Not Use Kafka
> > >> >      - Each Samza task reads a Snapshot of database and builds its
> > local
> > >> >      cache, and it then needs to read periodically to update its
> > >> >local cache. I
> > >> >      have read about a few blogs, and this doesn't sound a solid way
> > >> >in the long
> > >> >      term.
> > >> >
> > >> >Any thoughts?
> > >> >
> > >> >Chen
> > >> >
> > >> >   -
> > >> >
> > >> >--
> > >> >Chen Song
> > >>
> > >
> > >
> > >
> > >--
> > >Chen Song
> >
>



-- 
Chen Song

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message