hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9639) truly shared cache for jars (jobjar/libjar)
Date Mon, 09 Dec 2013 11:22:11 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843068#comment-13843068

Steve Loughran commented on HADOOP-9639:

Some quick comments on this

# the upload mechanism assumes that rename() is atomic. This should be spelled out, to avoid
people trying to use blobstores as their cache infrastructure
# obviously: add a specific exception to indicate some kind of race condition
# The shared cache enabled flags are obviously things that admins would have to right to set
and make final in yarn-site.xml files, clients to handle this without problems.

Security: # you have to also think about preserving the security of files I don't want to
share with others, either by allowing me to mix cached with uncached files (those keeping
configuration resources with sensitive information), or even let others in the cluster know
what binaries I'm pushing around. Presumably clusters that care about such things will just
disable the cache altogether, but there is the use case of "shared cache for most data, some
private resources". If that use case is not to be supported, we should at least call it out.

co-ordination wise

#  I (personally) think we should all just embrace the presence of 1+ ZK quorum on the cluster
as the core infrastructure HA systems need it, and it would avoid everyone trying to write
their "let's use the filesystem as a way to synchronize clients based on the assumption that
FileSystem.create() with overwrite==false guarantees unique access". But that's just an opinion,
I don't see that a side-feature should force the action, but at the same time, if the cache
it is optional, ZK could be made a prerequisite for caching. It would fundamentally change
how confident we could be that the system would be correct, even on filesystems that break
the assumptions of posix more-significantly than HDFS.

* [HADOOP-9361|https://github.com/steveloughran/hadoop-trunk/tree/stevel/HADOOP-9361-filesystem-contract/hadoop-common-project/hadoop-common/src/site/markdown/filesystem]
is attempting to formally define the semantics of a Hadoop-compatible filesystem. If you could
use that as the foundation assumptions & perhaps even [notation|https://github.com/steveloughran/hadoop-trunk/blob/stevel/HADOOP-9361-filesystem-contract/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/notation.md]
for defining your own behavior, the analysis on P7 could be proved more rigorously 

* The semantics of `{{happens-before}} comes from [Lamport78]  [Time, Clocks and the Ordering
of Events in a Distributed System|http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf],
so should be used as the citation as it is more appropriate than memory models of Java or
out-of-order CPUs.

* Script-wise, I've been evolving a [[generic YARN service launcher|https://github.com/hortonworks/hoya/tree/master/hoya-core/src/main/java/org/apache/hadoop/yarn/service/launcher],
which is nearly ready to submit as [YARN-679]: if the cleaner service were implemented as
a YARN service it could be invoked as a run-one command line, or deployed in a YARN container
service which provided cron-like services

> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>                 Key: HADOOP-9639
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9639
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: filecache
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf,
> Currently there is the distributed cache that enables you to cache jars and files so
that attempts from the same job can reuse them. However, sharing is limited with the distributed
cache because it is normally on a per-job basis. On a large cluster, sometimes copying of
jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth,
not to speak of defeating the purpose of "bringing compute to where data is". This is wasteful
because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared cache so that
multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion.

This message was sent by Atlassian JIRA

View raw message