spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "angerszhu (Jira)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11
Date Wed, 18 Dec 2019 14:13:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

angerszhu updated SPARK-29018:
------------------------------
    Description: 
h2. Background

    With the development of Spark and Hive,in current sql/hive-thriftserver module,
we need to do a lot of work to solve code conflicts for different built-in hive versions.
It's an annoying and unending work in current ways. And these issues have limited our ability
and convenience to develop new features for Spark’s thrift server. 

    We suppose to implement a new thrift server and JDBC driver based on Hive’s latest
v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature:
 # Build new module spark-service as spark’s thrift server 
 # Don't need as much reflection and inherited code as `hive-thriftser` modules
 # Support all functions current `sql/hive-thriftserver` support
 # Use all code maintained by spark itself, won’t depend on Hive
 # Support origin functions use spark’s own way, won't limited by Hive's code
 # Support running without hive metastore or with hive metastore
 # Support user impersonation by Multi-tenant splited hive authentication and DFS authentication
 # Support session hook for with spark’s own code
 # Add a new jdbc driver spark-jdbc, with spark’s own connection url  “jdbc:spark:<host>:<port>/<db>”
 # Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform

h2. How to start?

     We can start this new thrift server by shell *sbin/start-spark-thriftserver.sh*
and stop it by *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations
to determine the characteristics of the current spark thrift server service, we  have implemented
all need configuration by spark itself in `org.apache.spark.sql.service.internal.ServiceConf`,
hive-site.xml only used to connect to hive metastore. We can write all we needed conf in *conf/spark-default.conf*
or in startup command *--conf*
h2. How to connect through jdbc?

   Now we support both hive-jdbc and spark-jdbc, user can choose which one he likes
h3. spark-jdbc
 # use `SparkDriver` as jdbc driver class
 # Connection url `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list`
most samse as hive but with spark’s special url prefix `jdbc:spark`
 # For proxy, use SparkDriver should set proxy conf `spark.sql.thriftserver.proxy.user=username` 

h3. hive-jdbc
 # use `HiveDriver` as jdbc driver class
 # connection str jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list 
as origin 
 # For proxy, use HiveDriver should set proxy conf hive.server2.proxy.user=username, current
server support both config

h2. How is it done today, and what are the limits of current practice?
h3. Current practice

We have completed two modules `spark-service` & `spark-jdbc` now, it can run well  and
we have add origin UT to it these two module and it can pass the UT, for impersonation, we
have write the code and test it in our kerberized environment, it can work well and wait for
review. Now we will raise pr to apace/spark master branch step by step.
h3. Here are some known changes:
 # Don’t use any hive code in `spark-service` `spark-jdbc` module
 # In current service, default rcfile suffix  `.hiverc` was replaced by `.sparkrc`
 # When use SparkDriver as jdbc driver class, url should use jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
 # When use SparkDriver as jdbc driver class, proxy conf should be `spark.sql.thriftserver.proxy.user=proxy_user_name`
 # Support `hiveconf` `hivevar` session conf through hive-jdbc connection

h2. What are the risks?

    Totally new module, won’t change other module’s code except for supporting impersonation.
Except impersonation, we have added a lot of UT changed (fit grammar without hive) from origin
UT, and all pass it. For impersonation I have test it in our kerberized environment but still
need detail review since change a lot.
h2. How long will it take?

       We have done all these works in our own repo, now we plan merge our code into
the master step by step.
 # Phase1 pr about build new module *spark-service* on folder *sql/service*
 # Phase2 pr thrift protocol and generated thrift protocol java code
 # Phase3 pr with all *spark-service* module code and description about design, also Unnit
Test
 # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
 # Phase5 pr with all *spark-jdbc* module code and Unit Tests
 # Phase6 pr about support thriftserver Impersonation
 # Phase7 pr about build spark's own beeline client *spark-beeline*
 # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* module named *spark-cli*

h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward
and forward compatibility must be taken into account.

Compared to current `sql/hive-thriftserver`,  corresponding API changes as below:

 
 # Add a new class org.apache.spark.sql.service.internal.ServiceConnf, contains all needed
configuration for spark thrift server
 # ServiceSessionXxx as origin HiveSessionXxx
 # In ServiceSessionImpl, remove  code spark won’t use
 # In ServiceSessionImpl set session conf directly to sqlConf  like [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69]
 # Remove SparkSQLSessionManager, add logic into SessionMananger
 # Implement all OperationMananegr logic into SparkSQLOperationMananger and rename it to OperationManager
 # Add SQLContext to ServiceSessionImpl  as it’s variable, don’t pass it by SparkSQLOperationManager,
just get it by parentSession.getSqlContext() session conf was set to this sqlContext.sqlConf
 # Remove HiveServer2 since we don’t need the logic in it
 # Remove logic about hive impersonation since it won’t be useful in spark thrift server
and remove parameter delegationTokenStr in ServiceSessionImplWithUGI [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353] 
 we will use new way for spark’s impersonation.
 # Remove ThriftserverShimUtils, since we don’t need this
 # Remove SparkSQLCLIService just use CLIService 
 # Remove ReflectionUtils and ReflactCompositeService since we don’t need interition and
reflection

  was:
SPIP:Build Spark thrift server based on thrift protocol v11
h2. Background

    With the development of Spark and Hive,in current sql/hive-thriftserver module,
we need to do a lot of work to solve code conflicts for different built-in hive versions.
It's an annoying and unending work in current ways. And these issues have limited our ability
and convenience to develop new features for Spark’s thrift server. 

    We suppose to implement a new thrift server and JDBC driver based on Hive’s latest
v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature:
 # Build new module spark-service as spark’s thrift server 
 # Don't need as much reflection and inherited code as `hive-thriftser` modules
 # Support all functions current `sql/hive-thriftserver` support
 # Use all code maintained by spark itself, won’t depend on Hive
 # Support origin functions use spark’s own way, won't limited by Hive's code
 # Support running without hive metastore or with hive metastore
 # Support user impersonation by Multi-tenant splited hive authentication and DFS authentication
 # Support session hook for with spark’s own code
 # Add a new jdbc driver spark-jdbc, with spark’s own connection url  “jdbc:spark:<host>:<port>/<db>”
 # Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform

 
h2. How to start?

     We can start this new thrift server by shell *sbin/start-spark-thriftserver.sh*
and stop it by *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations
to determine the characteristics of the current spark thrift server service, we  have implemented
all need configuration by spark itself in `org.apache.spark.sql.service.internal.ServiceConf`,
hive-site.xml only used to connect to hive metastore. We can write all we needed conf in *conf/spark-default.conf*
or in startup command *--conf*
h2. How to connect through jdbc?

   Now we support both hive-jdbc and spark-jdbc, user can choose which one he likes

 
h3. spark-jdbc

 
 # use `SparkDriver` as jdbc driver class
 # Connection url `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list`
most samse as hive but with spark’s special url prefix `jdbc:spark`
 # For proxy, use SparkDriver should set proxy conf `spark.sql.thriftserver.proxy.user=username` 




h3. hive-jdbc

 
 # use `HiveDriver` as jdbc driver class
 # connection str jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list 
as origin 
 # For proxy, use HiveDriver should set proxy conf hive.server2.proxy.user=username, current
server support both config

 
h2. How is it done today, and what are the limits of current practice?
h3. Current practice

We have completed two modules `spark-service` & `spark-jdbc` now, it can run well  and
we have add origin UT to it these two module and it can pass the UT, for impersonation, we
have write the code and test it in our kerberized environment, it can work well and wait for
review. Now we will raise pr to apace/spark master branch step by step.
h3. Here are some known changes:

 
 # Don’t use any hive code in `spark-service` `spark-jdbc` module
 # In current service, default rcfile suffix  `.hiverc` was replaced by `.sparkrc`
 # When use SparkDriver as jdbc driver class, url should use jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
 # When use SparkDriver as jdbc driver class, proxy conf should be `spark.sql.thriftserver.proxy.user=proxy_user_name`
 # Support `hiveconf` `hivevar` session conf through hive-jdbc connection
 # 




h2. What are the risks?

    Totally new module, won’t change other module’s code except for supporting impersonation.
Except impersonation, we have added a lot of UT changed (fit grammar without hive) from origin
UT, and all pass it. For impersonation I have test it in our kerberized environment but still
need detail review since change a lot.

 
h2. How long will it take?

       We have done all these works in our own repo, now we plan merge our code into
the master step by step.
 # Phase1 pr about build new module *spark-service* on folder *sql/service*
 # Phase2 pr thrift protocol and generated thrift protocol java code
 # Phase3 pr with all *spark-service* module code and description about design, also Unnit
Test
 # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
 # Phase5 pr with all *spark-jdbc* module code and Unit Tests
 # Phase6 pr about support thriftserver Impersonation
 # Phase7 pr about build spark's own beeline client *spark-beeline*
 # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* module named *spark-cli*

 
h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward
and forward compatibility must be taken into account.

Compared to current `sql/hive-thriftserver`,  corresponding API changes as below:

 
 # Add a new class org.apache.spark.sql.service.internal.ServiceConnf, contains all needed
configuration for spark thrift server
 # ServiceSessionXxx as origin HiveSessionXxx
 # In ServiceSessionImpl, remove  code spark won’t use
 # In ServiceSessionImpl set session conf directly to sqlConf  like [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69]
 # Remove SparkSQLSessionManager, add logic into SessionMananger
 # Implement all OperationMananegr logic into SparkSQLOperationMananger and rename it to OperationManager
 # Add SQLContext to ServiceSessionImpl  as it’s variable, don’t pass it by SparkSQLOperationManager,
just get it by parentSession.getSqlContext() session conf was set to this sqlContext.sqlConf
 # Remove HiveServer2 since we don’t need the logic in it
 # Remove logic about hive impersonation since it won’t be useful in spark thrift server
and remove parameter delegationTokenStr in ServiceSessionImplWithUGI [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353] 
 we will use new way for spark’s impersonation.
 # Remove ThriftserverShimUtils, since we don’t need this
 # Remove SparkSQLCLIService just use CLIService 
 # Remove ReflectionUtils and ReflactCompositeService since we don’t need interition and
reflection


> Build spark thrift server on it's own code based on protocol v11
> ----------------------------------------------------------------
>
>                 Key: SPARK-29018
>                 URL: https://issues.apache.org/jira/browse/SPARK-29018
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: angerszhu
>            Priority: Major
>
> h2. Background
>     With the development of Spark and Hive,in current sql/hive-thriftserver module,
we need to do a lot of work to solve code conflicts for different built-in hive versions.
It's an annoying and unending work in current ways. And these issues have limited our ability
and convenience to develop new features for Spark’s thrift server. 
>     We suppose to implement a new thrift server and JDBC driver based on Hive’s
latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature:
>  # Build new module spark-service as spark’s thrift server 
>  # Don't need as much reflection and inherited code as `hive-thriftser` modules
>  # Support all functions current `sql/hive-thriftserver` support
>  # Use all code maintained by spark itself, won’t depend on Hive
>  # Support origin functions use spark’s own way, won't limited by Hive's code
>  # Support running without hive metastore or with hive metastore
>  # Support user impersonation by Multi-tenant splited hive authentication and DFS authentication
>  # Support session hook for with spark’s own code
>  # Add a new jdbc driver spark-jdbc, with spark’s own connection url  “jdbc:spark:<host>:<port>/<db>”
>  # Support both hive-jdbc and spark-jdbc client, then we can support most clients and
BI platform
> h2. How to start?
>      We can start this new thrift server by shell *sbin/start-spark-thriftserver.sh*
and stop it by *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations
to determine the characteristics of the current spark thrift server service, we  have implemented
all need configuration by spark itself in `org.apache.spark.sql.service.internal.ServiceConf`,
hive-site.xml only used to connect to hive metastore. We can write all we needed conf in *conf/spark-default.conf*
or in startup command *--conf*
> h2. How to connect through jdbc?
>    Now we support both hive-jdbc and spark-jdbc, user can choose which one he likes
> h3. spark-jdbc
>  # use `SparkDriver` as jdbc driver class
>  # Connection url `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list`
most samse as hive but with spark’s special url prefix `jdbc:spark`
>  # For proxy, use SparkDriver should set proxy conf `spark.sql.thriftserver.proxy.user=username` 
> h3. hive-jdbc
>  # use `HiveDriver` as jdbc driver class
>  # connection str jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list 
as origin 
>  # For proxy, use HiveDriver should set proxy conf hive.server2.proxy.user=username,
current server support both config
> h2. How is it done today, and what are the limits of current practice?
> h3. Current practice
> We have completed two modules `spark-service` & `spark-jdbc` now, it can run well 
and we have add origin UT to it these two module and it can pass the UT, for impersonation,
we have write the code and test it in our kerberized environment, it can work well and wait
for review. Now we will raise pr to apace/spark master branch step by step.
> h3. Here are some known changes:
>  # Don’t use any hive code in `spark-service` `spark-jdbc` module
>  # In current service, default rcfile suffix  `.hiverc` was replaced by `.sparkrc`
>  # When use SparkDriver as jdbc driver class, url should use jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
>  # When use SparkDriver as jdbc driver class, proxy conf should be `spark.sql.thriftserver.proxy.user=proxy_user_name`
>  # Support `hiveconf` `hivevar` session conf through hive-jdbc connection
> h2. What are the risks?
>     Totally new module, won’t change other module’s code except for supporting
impersonation. Except impersonation, we have added a lot of UT changed (fit grammar without
hive) from origin UT, and all pass it. For impersonation I have test it in our kerberized
environment but still need detail review since change a lot.
> h2. How long will it take?
>        We have done all these works in our own repo, now we plan merge our code
into the master step by step.
>  # Phase1 pr about build new module *spark-service* on folder *sql/service*
>  # Phase2 pr thrift protocol and generated thrift protocol java code
>  # Phase3 pr with all *spark-service* module code and description about design, also
Unnit Test
>  # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
>  # Phase5 pr with all *spark-jdbc* module code and Unit Tests
>  # Phase6 pr about support thriftserver Impersonation
>  # Phase7 pr about build spark's own beeline client *spark-beeline*
>  # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* module named
*spark-cli*
> h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, if any.
Backward and forward compatibility must be taken into account.
> Compared to current `sql/hive-thriftserver`,  corresponding API changes as below:
>  
>  # Add a new class org.apache.spark.sql.service.internal.ServiceConnf, contains all needed
configuration for spark thrift server
>  # ServiceSessionXxx as origin HiveSessionXxx
>  # In ServiceSessionImpl, remove  code spark won’t use
>  # In ServiceSessionImpl set session conf directly to sqlConf  like [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69]
>  # Remove SparkSQLSessionManager, add logic into SessionMananger
>  # Implement all OperationMananegr logic into SparkSQLOperationMananger and rename it
to OperationManager
>  # Add SQLContext to ServiceSessionImpl  as it’s variable, don’t pass it by SparkSQLOperationManager,
just get it by parentSession.getSqlContext() session conf was set to this sqlContext.sqlConf
>  # Remove HiveServer2 since we don’t need the logic in it
>  # Remove logic about hive impersonation since it won’t be useful in spark thrift server
and remove parameter delegationTokenStr in ServiceSessionImplWithUGI [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353] 
 we will use new way for spark’s impersonation.
>  # Remove ThriftserverShimUtils, since we don’t need this
>  # Remove SparkSQLCLIService just use CLIService 
>  # Remove ReflectionUtils and ReflactCompositeService since we don’t need interition
and reflection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message