samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guozhang Wang <guw...@linkedin.com>
Subject RE: state management docs
Date Tue, 10 Sep 2013 06:00:07 GMT
1. How about "sequential logging stream"?

Guozhang
________________________________________
From: Jay Kreps [jay.kreps@gmail.com]
Sent: Monday, September 09, 2013 9:40 PM
To: dev@samza.incubator.apache.org
Subject: Re: state management docs

Thanks for the feedback Guozhang! Those are excellent points!

1. Well technically none of the output kafka topics are necessarily
collocated with the task producing to that topic. The little databases in
the boxes are meant to be the leveldbs, perhaps I should label them? I do
think the terminology (stream vs log) is quite confusing. Unfortunately the
terminology log tends to be very confusing to non-database people who think
logs are text files! What do you think would be the least confusing set of
terms?

2. This is a good point, I will add that.

3. Yes to give exact consistency in the presense of faults a few things are
required, one of which is restoring to the same point between changelog,
inputs, and output.

4. Yes but when the task fails it will restart on another machine that
doesn't have the leveldb image on disk. The one exception to this is when
the user kills their own job. In this case it would be nice to have an
optimization that let's us preferentially reuse the same machine to avoid
rebuilding the leveldb.

-Jay


On Mon, Sep 9, 2013 at 6:43 PM, Guozhang Wang <guwang@linkedin.com> wrote:

> Well written doc. I think I can understand every point even with limited
> knowledge of Samza :)
>
> A few comments:
>
> 1) stateful_job.png
>
> Could you make the changelog streams into the box of tasks to demo they
> are collocated? And could we also add the KV stores into the box? At first
> glance I thought the changelog is actually the store. Also the name
> "changelog stream" is a little confusing. At first I was wondering why this
> is called a stream and would it be just a "log". I only get this while
> reading the fault tolerance section.
>
> 2) "This does not impact consistency—a task always reads what it wrote
> (since it checks the cache first)"
>
> I think another main reason is that a task only read/write its own
> partition.
>
> 3) Do we need guarantee consistency between the incoming stream and the
> change log stream. For example, the change log should indicate that this
> change entry is for the state processed until offset X of the incoming
> stream so that on failover we know where we should set the offset of the
> incoming stream?
>
> 4) Would reading directly from LevelDB a better approach than replaying
> the whole (although compacted) changelog? Writes to LevelDB should be
> persistent, and we only need to make sure we have checkpoints, for example,
> the current state (excluding writes that are still in cache) represent all
> the changes till the offset of the changelog?
>
> Guozhang
>
>
>
> ________________________________________
> From: Jay Kreps [jay.kreps@gmail.com]
> Sent: Thursday, September 05, 2013 5:14 PM
> To: dev@samza.incubator.apache.org
> Subject: state management docs
>
> I took a pass at improving the state management documentation (talking to
> people, I don't think anyone understood what we were saying):
>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html
>
> I would love to get some feedback on this, especially from anyone who
> doesn't already know Samza. Does this make any sense? Does it tell you what
> you need to know in the right order?
>
> -Jay
>

Mime
View raw message