samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From santhosh venkat <>
Subject Re: Periodic cleanup of unused local stores
Date Fri, 09 Sep 2016 02:17:24 GMT
Hi Navina,

Thanks for the review and the comments. Please find my replies inline.

1. It is always very useful to provide more context to the reader, esp. in
explaining what the different terms mean (like host-affinity, tombstone
etc) and how it relates to the problem being described."

>> Updated the design doc with a glossary section, where the
terms are described briefly.

2. "The Host Affinity feature in Samza enables it to restore local state
from disk instead of bootstrapping the entire changelog" -> host-affinity
as a features only tries to bring-up the container in the same host as
before. This will help samza leverage the locally persisted store data. It
doesn't actually help it restore state in anyway.

>> I've rephrased it accordingly in the design doc.

3. "To achieve this, Samza stores local state for change logged stores in a
shared directory so it is not tied to a resource manager’s storage
structure and cleanup schedule." -> I think by shared directory, you are
referring to the yarn application's workspace. This shared workspace is
part of the NM, not the RM. You can rephrase this and additionally, provide
the logical path to the state stores.

>> Yes, it was mentioned incorrectly. I've fixed it in the design doc.

4. " Expose an API in samza­rest that" -> Can you elaborate what the API
looks like ?

>> This API would take in jobId and jobName as parameters
and return the preferred host for all the tasks in the job.

Request URL:  http://Host:Port/v1/jobs/{jobName}/{jobId}/containers

Sample json response


  "jobName" : "Job name",

  "jobId" : "Job id",

  "containers" : [


      "name" : "Container name",

      "id" :  “1”,

      "tasks" : [{

     "name" : "Task name",

  "partitions" : ["Id 1","Id 2"],

  "preferredHost" : "Host name"



Alternatively, granular API’s at task and container levels could be exposed
rather than a single API returning the complete job model hierarchy. To
construct the complete job hierarchy with the granular API’s, job's
coordinator stream has be queried multiple times(for each of the containers
and tasks), leading to performance problems.

5. Is the rest-api to be invoked by the monitor for all jobs in the cluster
or all running jobs ? What is the criteria there? Please do mention them,
if any.

>> Monitor will use the rest-api for all the jobs in the cluster
that has host affinity enabled.

Updated Design doc is here:

Please let me know your thoughts.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message