samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Pettitt <cpett...@linkedin.com>
Subject Re: Review Request 44920: Remove tight coupling of Samza with Yarn. Define APIs for resource manager integration
Date Thu, 17 Mar 2016 23:00:12 GMT


> On March 16, 2016, 8:48 p.m., Chris Pettitt wrote:
> > samza-core/src/main/java/org/apache/samza/clustermanager/SamzaAppState.java, line
36
> > <https://reviews.apache.org/r/44920/diff/1/?file=1301270#file1301270line36>
> >
> >     Agreed. This is a bit mix of atomic and mutable state. Either make it thread
safe (e.g. lock down mutability of fields), ensure that remaining mutable state can be updated
atomically, or decide it is not intended to be thread safe and get rid of additional locks
/ volatile accesses that add cost with no gain. Former is preferred :).
> 
> Jagadish Venkatraman wrote:
>     Great feedback Chris! I could not agree more! :-)
>     
>     SamzaAppState class is a currently a source of major problems. I did not want to
touch it (as it was not scoped in this refactoring). Upon digging further, I realize the problem
of making this thread-safe/private is slightly involved.
>     
>     1. There is a jobCoordinator object that is exposed publicly as a part of SamzaAppState.
The jobCoordinator inturn exposes a nested jobModel instance directly thorough its accessor.
The JobModel embeds a LocalityManager that mutates state during some public method calls.
Hence, The jobModel instance is *not* thread-safe and is shared concurrently across the UI
threads, the HTTP server threads in the queued thread pool,the SamzaAppMaster thread. (Created
SAMZA-899 to make the JobModel immutable)
>     
>     2. There are a bunch of state data structures that are publicly exposed in SamzaAppState.
These must be made thread-safe into accessors. These public global variables could be mutated
everywhere in Samza without regard for safety/visibility or correctness. For example, there
is an integer containerCount that is public which is manipulated by both the metrics reporter
and the callback threads. (I created SAMZA-901 to track this)
>     
>     I will work on both of these as these ASAP.
> 
> Jagadish Venkatraman wrote:
>     Just a clarification:
>     1. The JobModel instance is shared concurrently as stated in [1]. This presents a
source of *potential* problems. (I believe there is not an actual bug in the JobModel )
>     2. The containerCount is a public int that *could* be manipulated by both the reporter
and callback. I believe the current interaction does not have any races (since count is just
set at the startup once). But, having as public non-final int *could* be a source of potential
problems if it was modified elsewhere.

For #2, the state of containerCount in onContainerCompleted is undefined unless start transitively
"happens-before" onContainerCompleted. It may be that that holds, but it is not obvious to
me (without spending more time working out the call graph). If it's not a problem now it could
become one with a seemingly innocent refactoring.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44920/#review123902
-----------------------------------------------------------


On March 16, 2016, 6:23 p.m., Jagadish Venkatraman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44920/
> -----------------------------------------------------------
> 
> (Updated March 16, 2016, 6:23 p.m.)
> 
> 
> Review request for samza, Boris Shkolnik, Chris Pettitt, Jake Maes, Yi Pan (Data Infrastructure),
Navina Ramesh, and Xinyu Liu.
> 
> 
> Repository: samza
> 
> 
> Description
> -------
> 
> Samza currently has tight coupling with Yarn. This makes it impossible to integrate with
other resource managers like Mesos, or to run standalone without any resource manager. This
RB is a step to implementing SAMZA-881.
> 
> Design Doc: https://issues.apache.org/jira/secure/attachment/12790540/SamzaJobCoordinatorRe-designProposal.pdf
> 
> 1.Proposed new APIs for a resource manager to integrate with Samza. (SAMZA-881)
>    - Defined the ContainerProcessManager abstraction, SamzaResource, SamzaResourceRequest.

>    - Re-wrote the SamzaAppMaster into a ClusterBasedJobCoordinator.
>    - Re-wrote yarn specific request logic by abstracting it into a YarnContainerManager.

> 2.Defined a ClusterManagerConfig to handle configurations independent of Yarn/Mesos.
> 3.Made Samza's cluster interaction independent of Yarn. This separates Samza specific
components into samza-core and Yarn components into samza-yarn.
> 4.Readability improvements to the existing code base.
>    -Added docs for most methods, member variables and classes (including on thread-safety)
>    - Made internal variables final to document intent, visibility across threads. (trivially
by adding modifiers, or by changing where they're initialized.)
> 5.Refactored JobCoordinator into a JobModelReader.
> 
> TODO: Can go into the upcoming release. (P0)
> 1.Refactor the UI state variables and tests. Port some method re-orgs from SAMZA-867
into here.
> 2.Revise packaging structure.
> 4.Document new configs.
> 5.Rename run-am.sh to run-coordinator.sh, Delete all files in the non-refactored namespace.
(For unit-testing, these files continue to exist)
> 
> TODO: (P1)
> 1.Build Mesos integration for Samza. Should be simpler to integrate with the newer APIs.
>   - I started on this, and I plan to refine and post an RB in one of the hack-days.
> 2.Refactor the SamzaAppState class to provide more accessors and eliminate public variables.
(This was
> a consequence of the already existing design which I've tried to be compatible with)
> 
> TODO: I plan to track these with JIRAs so that they can be done later. (P2)
> 1.Get rid of the HTTP Server in the JobCoordinator
> 2.Make YarnJobCoordinator implement the JobCoordinator API as SAMZA-881.
> 
> 
> Diffs
> -----
> 
>   samza-core/src/main/java/org/apache/samza/clustermanager/AbstractContainerAllocator.java
PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/clustermanager/ClusterBasedJobCoordinator.java
PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/clustermanager/ContainerAllocator.java PRE-CREATION

>   samza-core/src/main/java/org/apache/samza/clustermanager/ContainerManagerFactory.java
PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java
PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/clustermanager/ContainerRequestState.java
PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/clustermanager/HostAwareContainerAllocator.java
PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/clustermanager/ResourceFailure.java PRE-CREATION

>   samza-core/src/main/java/org/apache/samza/clustermanager/SamzaAppState.java PRE-CREATION

>   samza-core/src/main/java/org/apache/samza/clustermanager/SamzaContainerLaunchException.java
PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/clustermanager/SamzaResource.java PRE-CREATION

>   samza-core/src/main/java/org/apache/samza/clustermanager/SamzaResourceRequest.java
PRE-CREATION 
>   samza-core/src/main/java/org/apache/samza/clustermanager/SamzaResourceStatus.java PRE-CREATION

>   samza-core/src/main/java/org/apache/samza/clustermanager/SamzaTaskManager.java PRE-CREATION

>   samza-core/src/main/java/org/apache/samza/config/ClusterManagerConfig.java PRE-CREATION

>   samza-core/src/main/scala/org/apache/samza/coordinator/JobModelReader.scala PRE-CREATION

>   samza-core/src/main/scala/org/apache/samza/metrics/SamzaAppMasterMetrics.scala PRE-CREATION

>   samza-shell/src/main/bash/run-am.sh 9545a96953baaff17ad14962e02bc12aadbb1101 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/refactor/YarnAppState.java PRE-CREATION

>   samza-yarn/src/main/java/org/apache/samza/job/yarn/refactor/YarnContainerManager.java
PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/refactor/YarnContainerManagerFactory.java
PRE-CREATION 
>   samza-yarn/src/main/java/org/apache/samza/job/yarn/refactor/YarnContainerRunner.java
PRE-CREATION 
>   samza-yarn/src/main/scala/org/apache/samza/job/yarn/YarnAppMasterListener.scala 6bf3046a1ae4ed8f57500acae763184084ad0e09

>   samza-yarn/src/main/scala/org/apache/samza/job/yarn/refactor/SamzaAppMasterService.scala
PRE-CREATION 
>   samza-yarn/src/main/scala/org/apache/samza/job/yarn/refactor/SamzaYarnAppMasterLifecycle.scala
PRE-CREATION 
>   samza-yarn/src/test/scala/org/apache/samza/job/yarn/TestSamzaAppMasterLifecycle.scala
30cf34fe1fd3f74537d16e8a51b467cd50835357 
>   samza-yarn/src/test/scala/org/apache/samza/job/yarn/TestSamzaAppMasterService.scala
7f5d9f4af088589d4287c26737bae25567c157d7 
> 
> Diff: https://reviews.apache.org/r/44920/diff/
> 
> 
> Testing
> -------
> 
> 1.Tested with running test jobs in Yarn clusters of varying sizes from 1 node to 36 nodes.
> 2.Tested for failures by killing containers, and ensuring they were brought up again.
> 
> TODO:
> 1.Refactor all unit tests, and ensure updates to state variables are consistent with
the current design.
> 2.More testing with test jobs on clusters.
> 
> 
> Thanks,
> 
> Jagadish Venkatraman
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message