samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@apache.org>
Subject Re: Dealing with partitioning mismatches between bootstrap and input streams
Date Tue, 07 Apr 2015 21:32:34 GMT
Hey Tommy,

Your summary sounds pretty accurate. One other way, which requires no
change to Samza, would be to repartition the input topic properly for each
task. This is kind of hacky, though.

(2) is the ideal solution. It is a bit of work, but it might not be so bad.
I think most of the changes would be isolated to the TaskStorageManager.
We'd also need to make the KV store read-only, which is pretty easy to do.
If you're not comfortable with it, though, then (1) would be your next-best
bet.

Cheers,
Chris

On Tue, Apr 7, 2015 at 10:16 AM, Tommy Becker <tobecker@tivo.com> wrote:

> We have a Kafka topic containing data needed by several Samza jobs. These
> jobs will essentially read the data and build up state that will be used
> for processing their inputs. Ideally, we would use the topic as a bootstrap
> stream to build up this state. The problem with that is the topic
> containing the data has a single partition but the topics these jobs are
> processing as input have multiple partitions. So my understanding is that
> only one task instance in the job would actually process the bootstrap
> stream, and therefore any state it built up would be local to that task. So
> I'm thinking my options are the following:
>
> 1) Implement SAMZA-353 and allow the bootstrap SSP to be assigned to each
> task instance
> 2) Implement the shared state store component of SAMZA-402
> 3) Layer the shared state on top of Samza in our tasks themselves, maybe
> by using something like RocksDB directly.
>
> Number 1 seems easiest to implement at the cost of having the entire state
> duplicated for each task.  I'd prefer not to do number 3 given the
> existence of this feature on Samza's roadmap, but I am a bit concerned
> about the scope of work with number 2, and the fact that this is mostly
> Scala code.
>
> Are there any alternatives that I'm missing?  Note that we need to process
> the data stream as a bootstrap stream.  Using it as a changelog is
> insufficient because we need to be able to manipulate the data before
> building up the state store.
>
> --
> Tommy Becker
> Senior Software Engineer
>
> Digitalsmiths
> A TiVo Company
>
> www.digitalsmiths.com<http://www.digitalsmiths.com>
> tobecker@tivo.com<mailto:tobecker@tivo.com>
>
> ________________________________
>
> This email and any attachments may contain confidential and privileged
> material for the sole use of the intended recipient. Any review, copying,
> or distribution of this email (or any attachments) by others is prohibited.
> If you are not the intended recipient, please contact the sender
> immediately and permanently delete this email and any attachments. No
> employee or agent of TiVo Inc. is authorized to conclude any binding
> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
> Inc. may only be made by a signed written agreement.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message