hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elek, Marton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14898) Create official Docker images for development and testing features
Date Thu, 01 Mar 2018 16:59:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382311#comment-16382311

Elek, Marton commented on HADOOP-14898:

Just some minor clarification:

 * Images are built by dockerhub. INFRA can register the branches on dockerhub and dockerhub
will automatically fetch the latest branch and build the images on change. (From the docker-*
branches not the real branches which contain only the Dockerfile files)

 * I proposed to use the binary releases for the image creation which are voted by PMC members.
The image will contain exactly the same Hadoop. There *won't be and mvn build * during the
docker image creation just some packaging (download hadoop and extract it). 

 * The hadoop-runner images won't be released with every hadoop release. Ideally we need to
creat it once. As it contains only a startup script I think we can handle it outside of the
release structure (this is how it is handled by other Apache projects). 

* HADOOP-15259 will provide a Dockerfile for developer images with snapshot version of Hadoop
*locally*. They won't be uploaded to the dockerhub.

 * apache/hadoop-runner:latest  --  contains only the base OS + a very simple starter script
. Only latest version is available, no releases.
 * apache/hadoop:2 -- contains the latest stable hadoop 2 release (official binary release,
approved by the PMC) on top of hadoop-runner:latest
 * apache/hadoop:3 -- same, but with the latest hadoop 3 release. (It should be updated *after*
the release vote with the official download link) 

I would be more than happy to participate on a call if you are interested about how it could
be used or discuss the implementation. Please propose a time period or we can also continue
the discussion here.

> Create official Docker images for development and testing features 
> -------------------------------------------------------------------
>                 Key: HADOOP-14898
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14898
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Major
>         Attachments: HADOOP-14898.001.tar.gz, HADOOP-14898.002.tar.gz, HADOOP-14898.003.tgz,
> This is the original mail from the mailing list:
> {code}
> TL;DR: I propose to create official hadoop images and upload them to the dockerhub.
> GOAL/SCOPE: I would like improve the existing documentation with easy-to-use docker based
recipes to start hadoop clusters with various configuration.
> The images also could be used to test experimental features. For example ozone could
be tested easily with these compose file and configuration:
> https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6
> Or even the configuration could be included in the compose file:
> https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml
> I would like to create separated example compose files for federation, ha, metrics usage,
etc. to make it easier to try out and understand the features.
> CONTEXT: There is an existing Jira https://issues.apache.org/jira/browse/HADOOP-13397
> But it’s about a tool to generate production quality docker images (multiple types,
in a flexible way). If no objections, I will create a separated issue to create simplified
docker images for rapid prototyping and investigating new features. And register the branch
to the dockerhub to create the images automatically.
> MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a while and
run them succesfully in different environments (kubernetes, docker-swarm, nomad-based scheduling,
etc.) My work is available from here: https://github.com/flokkr but they could handle more
complex use cases (eg. instrumenting java processes with btrace, or read/reload configuration
from consul).
>  And IMHO in the official hadoop documentation it’s better to suggest to use official
apache docker images and not external ones (which could be changed).
> {code}
> The next list will enumerate the key decision points regarding to docker image creating
> A. automated dockerhub build  / jenkins build
> Docker images could be built on the dockerhub (a branch pattern should be defined for
a github repository and the location of the Docker files) or could be built on a CI server
and pushed.
> The second one is more flexible (it's more easy to create matrix build, for example)
> The first one had the advantage that we can get an additional flag on the dockerhub that
the build is automated (and built from the source by the dockerhub).
> The decision is easy as ASF supports the first approach: (see https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096)
> B. source: binary distribution or source build
> The second question is about creating the docker image. One option is to build the software
on the fly during the creation of the docker image the other one is to use the binary releases.
> I suggest to use the second approach as:
> 1. In that case the hadoop:2.7.3 could contain exactly the same hadoop distrubution as
the downloadable one
> 2. We don't need to add development tools to the image, the image could be more smaller
(which is important as the goal for this image to getting started as fast as possible)
> 3. The docker definition will be more simple (and more easy to maintain)
> Usually this approach is used in other projects (I checked Apache Zeppelin and Apache
> C. branch usage
> Other question is the location of the Docker file. It could be on the official source-code
branches (branch-2, trunk, etc.) or we can create separated branches for the dockerhub (eg.
docker/2.7 docker/2.8 docker/3.0)
> For the first approach it's easier to find the docker images, but it's less flexible.
For example if we had a Dockerfile for on the source code it should be used for every release
(for example the Docker file from the tag release-3.0.0 should be used for the 3.0 hadoop
docker image). In that case the release process is much more harder: in case of a Dockerfile
error (which could be test on dockerhub only after the taging), a new release should be added
after fixing the Dockerfile.
> Another problem is that with using tags it's not possible to improve the Dockerfiles.
I can imagine that we would like to improve for example the hadoop:2.7 images (for example
adding more smart startup scripts) with using exactly the same hadoop 2.7 distribution. 
> Finally with tag based approach we can't create images for the older releases (2.8.1
for example)
> So I suggest to create separated branches for the Dockerfiles.
> D. Versions
> We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or just for the
main version (2.8/2.7). As these docker images are not for the production but for prototyping
I suggest to use (at least as a first step) just the 2.7/2.8 and update the images during
the bugfix release.
> E. Number of images
> There are two options here, too: Create a separated image for every component (namenode,
datanode, etc.) or just one, and the command should be defined everywhere manually. The second
seems to be more complex (to use), but I think the maintenance is easier, and it's more visible
what should be started 
> F. Snapshots
> According to the spirit of the Release policy:
> https://www.apache.org/dev/release-distribution.html#unreleased
> We should distribute only final releases to the dockerhub and not snapshots. But we can
create an empty hadoop-runner image as well, which container the starter scripts but not hadoop.
It would be used for development locally where the newly built distribution could be maped
to the image with docker volumes.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message