spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
Date Wed, 13 Feb 2019 11:41:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767077#comment-16767077
] 

Steve Loughran commented on SPARK-23534:
----------------------------------------

bq. am curious to know if hadoop3 offers much performance benefit

if you are using S3 as the destination of work you get an output committer which is O(files),
not O(data), and can cope with an inconsistent store (HADOOP-13786). Not sure of what else
you can point to and say "tangible speedup", though can point to stuff and and say 'tangible
functionality improvement"

with Hadoop 3.2 spark can generate delegation tokens for an S3 filesystem during spark-submit
(HADOOP-14556), and include them in the Yarn app launch. This lets you deploy a cluster in
EC2 with the VMs deployed in an IAM role with lower privileges  than you: a generated session
login and your encryption secrets will come with the job. This is very slick. And if you ask
for role delegation tokens then the generated token is limited to the specific s3 bucket and
DDB table you are working with. Video of distcp in action: https://www.youtube.com/watch?v=rpyLkDEzIxI

Also ships with the abfs:// connector to Azure Datalake Gen 2 storage; Microsoft's latest
iteration of Azure storage. 

> Spark run on Hadoop 3.0.0
> -------------------------
>
>                 Key: SPARK-23534
>                 URL: https://issues.apache.org/jira/browse/SPARK-23534
>             Project: Spark
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 2.3.0
>            Reporter: Saisai Shao
>            Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make sure Spark
can run with Hadoop 3.0. This Jira tracks the work to make Spark run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message