samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manohar Reddy <Manohar.Re...@happiestminds.com>
Subject Re: Regarding my use case to explore with Samza-
Date Wed, 02 Mar 2016 07:31:48 GMT


Hi Ramesh&Jagadesh,

You are very clear about my use case .Please find my inline comments for your query on my
use case.

We are very thanks about your inputs on my use case and definitely we will consider your adapter
concept to do a small batch jdbc calls instead of each and every event.

Please find my inline comments for your question.





~~Manohar

-----Original Message-----
From: Navina Ramesh [mailto:nramesh@linkedin.com.INVALID]
Sent: Wednesday, March 2, 2016 12:42 PM
To: dev@samza.apache.org
Subject: Re: Regarding my use case to explore with Samza-



Hi Manohar,

On a side note regarding your use-case, I have a question.

After consuming the DML changes from the kafka topic, why do you have to query back? Are you
trying to decorate the event or perform some kind of join?

[Manohar]we  are performing joins, events are only primary tables so to get whole data set
we are performing some set of joins .





The point I am trying to make is that if you perform a remote lookup with every event you
consume, it's going to be hard to keep to "realtime" (then again, realtime really depends
on your SLA).



Instead, I would suggest that you have an adapter that periodically takes a snapshot of the
entire table and pushes it to another topic in Kafka (not sure how hard it is going to be
to write an adapter). This way, when you job starts, it can partition and cache the entire
data set in the Samza task (by using RocksDb with changelog, as Jagadish suggested). Samza
provides a "bootstrap" stream option that is read during job-startup until no more messages
are available. You can configure your snapshot stream to be a bootstrap stream, essentially.

Once your job is "bootstrapped", you can process events by looking up in the local partitioned
store rather than the remote store. Please note that the DML change topic and data set need
to be partitioned with the same key.

Otherwise, it won't work correctly.





Another alternative is make a remote call to fetch the data set and cache it locally with
rocksdb. This is much simpler to implement, however, it depends on how you configure your
cache and the job will be eventually close to "realtime".



Hope my suggestions makes sense. Apologies, if I have misunderstood your use-case.



Feel free to ask any questions you may have.



Cheers!

Navina



On Tue, Mar 1, 2016 at 10:43 PM, Jagadish Venkatraman < jagadish1989@gmail.com<mailto:jagadish1989@gmail.com>>
wrote:



> Please take a look at the hello-world example. You can implement your

> business logic in the process() callback.

>

> What kind of transformation are you doing? Are you doing a group

> by/count style aggregation to generate the report? If so, you could

> use the embedded rocksdb store in Samza and potentially batch your writes to the database.

>

> How many Qps do you process at peak? Do you expect to buffer any state

> per message? What's the ratio of input to output messages on average?

>

> There's nothing that stops you from using JDBC and Samza.

>

> On Tue, Mar 1, 2016 at 8:58 PM, Manohar Reddy <

> Manohar.Reddy@happiestminds.com<mailto:Manohar.Reddy@happiestminds.com>> wrote:

>

> > Hello Team,

> >

> > we are part of some service based company and trying to explore the

> > available real time streaming technologies.so one of the first

> > option we are trying is Samza.

> > let me explain brief about my use case here:

> >

> > we are trying to build  real time reporting dashboard for e-learning

> > domain.

> > To build this dash board the input is RDBMS.so if any

> > DML(inserts/updates/deletes)  into the source RDBMS,immediately some

> > adapter will publish to kafka with RDBMS table name and primary keys

> > as JSON format.

> > Now Samza has to consume the kafka event and query back to source

> > RDBMS table to get the whole data set of  RDBMS relation tables by

> > using json event information.

> > now do some transformation here as per business and load into

> > Target(Reporting DB) RDBMS.

> > more or less here we are handling with few  JDBC calls through Samza

> > and here every day data load is small I can say max 2Gb of data but

> > we need real time processing ecosystem in place.

> > that's it brief about my usecase,so team please provide your inputs

> > how

> we

> > can approach with samza for this requirement.is there any utility

> > API with Samza for JDBC calls.

> >

> > Thank you very much in Advance.

> >

> > ~~Manohar

> > ________________________________

> > Happiest Minds Disclaimer

> >

> > This message is for the sole use of the intended recipient(s) and

> > may contain confidential, proprietary or legally privileged

> > information. Any unauthorized review, use, disclosure or

> > distribution is prohibited. If

> you

> > are not the original intended recipient of the message, please

> > contact

> the

> > sender by reply email and destroy all copies of the original message.

> >

> > Happiest Minds Technologies <http://www.happiestminds.com>

> >

> > ________________________________

> >

>

>

>

> --

> Jagadish V,

> Graduate Student,

> Department of Computer Science,

> Stanford University

>







--

Navina R.

________________________________
Happiest Minds Disclaimer

This message is for the sole use of the intended recipient(s) and may contain confidential,
proprietary or legally privileged information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the original intended recipient of the message,
please contact the sender by reply email and destroy all copies of the original message.

Happiest Minds Technologies <http://www.happiestminds.com>

________________________________

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message