samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Edwards <edwards.b...@gmail.com>
Subject Re: Truncating rocks db
Date Tue, 17 Feb 2015 08:10:51 GMT
I think having followed along with the other thread, my initial approach
was flawed. We use Cassandra in prod a ton (the classic Cassandra / Spark
combo) at my job and have been running into a few issues with streaming /
local state etc etc. Hence my wanting to have a look at Samza. Very long
way round to say that we use TTLs for lots of things! Thanks for the
write-up about the interaction between the db and the changelog . Very
thorough. I might come back with a request about the fresh store feature,
but it definitely needs a bit more baking / experience with Samza.

Ben

On Tue Feb 17 2015 at 01:59:03 Chris Riccomini <criccomini@apache.org>
wrote:

> Hey Ben,
>
> The problem with TTL is that it's handled entirely internally in RocksDB.
> There's no way for us to know when a key's been deleted. You can work
> around this if you also alter the changelog topic settings in your
> changelog Kafka topic to be TTL based, not log-compacted, then these two
> should roughly match. For example, if you have a 1h TTL in RocksDB and a 1h
> TTL in your Kafka changelog topic, then the semantics are ROUGHLY
> equivalent. I say ROUGHLY because the two are going to be GC'ing expired
> keys independently of one another.
>
> Also, during a restart, the TTLs in the RocksDB store will be fully reset.
> For example, if at minute 59 of a key, you restart the job, then the Kafka
> topic will restore it when the job starts, and the TTL will reset back to 0
> minutes in the RocksDB store (though, a minute later Kafka will drop it
> from the changelog). If you don't need EXACT TTL guarantees, then this
> should be fine. If you do need exact, then .all() is probably the way to
> go.
>
> Cheers,
> Chris
>
> On Mon, Feb 16, 2015 at 1:39 PM, Benjamin Edwards <edwards.benj@gmail.com>
> wrote:
>
> > Yes, I was using a changelog. You bring up a good point. I think I need
> to
> > think harder about what I am trying to do. Maybe deleting all the keys
> > isn't that bad. Especially is I amortise it over the life of the next
> > period.
> >
> > It seems like waiting for TTLs is probably the right thing to do
> > ultimately.
> >
> > Thanks for the timely response!
> >
> > Ben
> >
> > On Sun Feb 15 2015 at 23:43:27 Chris Riccomini <criccomini@apache.org>
> > wrote:
> >
> > > Hey Benjamin,
> > >
> > > You're right. Currently you have to call .all(), and delete everything.
> > >
> > > RocksDB just committed TTL support for their Java library. This feature
> > > allows data to automatically be expired out. Once RocksDB releases
> their
> > > TTL patch (I believe in a few weeks, according to Igor), we'll update
> > Samza
> > > 0.9.0. Our tracker patch is here:
> > >
> > >   https://issues.apache.org/jira/browse/SAMZA-537
> > >
> > > > Is there no way to just say I don't care about the old data, gimme a
> > new
> > > store?
> > >
> > > We don't have this feature right now, but we could add it. This feature
> > is
> > > a bit more complicated when a changelog is attached, since we will have
> > to
> > > execute deletes for every key (we still need to call .all()). Are you
> > > running with a changelog?
> > >
> > > Cheers,
> > > Chris
> > >
> > > On Sun, Feb 15, 2015 at 10:41 AM, Benjamin Edwards <
> > edwards.benj@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am trialling samza for some windowed stream processing. Typically I
> > > want
> > > > to aggregate a bunch of state over some window of messages, process
> the
> > > > data, then drop the current state. The only way that I can see to do
> > that
> > > > at the moment is to delete every key. This seems expensive. Is there
> no
> > > way
> > > > to just say I don't care about the old data, gimme a new store?
> > > >
> > > > Ben
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message