samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Pan <>
Subject Re: Samza Join example
Date Thu, 14 Jan 2016 17:47:03 GMT
Hi, Stanislov,

That's awesome! It would be great to have this integrated w/ Samza
tutorial. Would you mind to create a tutorial page for the join job
implementation in Samza?

Thanks a lot!


On Thu, Jan 14, 2016 at 7:28 AM, Stanislav Los <>

> If anyone interested, I did a quick PoC attempting to join two data sets
> using hello-samza as a starting point.
> Points to note, I did it in Scala.
> Our target was to keep at least 1 hour window of resent data at any given
> point in time, i.e ~200,000,000 records/h throughput for the first data set
> (ad auction bids), ~20,000,000/h for another another data set (ad
> impressions). That way, we're not constrained by order of events as much
> and data streams can be quite out of sync in case of replay from archive
> storage.
> You can find PoC that runs on local Samza grid here
>, or pull
> request not for merging, but just to keep changes in one place
> Can't brush it up for
> proper merge with master, since I'm being pulled to other task, but at
> least it's not lost and someone can find it useful.
> See src/main/scala/README for details.
> I have another branch that runs on CDH at scale, but I think it's overkill
> for current topic. Anyway, if you don't mind Magnetic specific stuff (no
> legal obligations), it's here
> Overall we were very impressed with Samza performance, it took just 30
> containers (30 partitions on each Kafka topic) with default settings to do
> a reliable join on our Hadoop cluster. Just for the record, on Spark
> Streaming I was able to keep only a couple of minutes Bids window with lots
> of other constraints and workarounds.
> Samza is our way to go with large RT joins.
> Regards,
> Stan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message