I am trying to write a Spark program that reads data from HBase and store it in DataFrame.
I am able to run it perfectly with hbase-site.xml in the $SPARK_HOME/conf folder, but I am facing few issues here.
Issue 1
The first issue is passing hbase-site.xml location with the --files parameter submitted through client mode (it works in cluster mode).
When I removed hbase-site.xml from $SPARK_HOME/conf and tried to execute it in client mode by passing with the --files parameter over YARN I keep getting the an exception (which I think means it is not taking the ZooKeeper configuration from hbase-site.xml.
spark-submit \
--master yarn \
--deploy-mode client \
--files /home/siddesh/hbase-site.xml \
--class com.orzota.rs.json.
HbaseConnector \ --packages com.hortonworks:shc:1.0.0-2.0-
s_2.11 \ --repositories http://repo.hortonworks.com/
content/groups/public/ \target/scala-2.11/test-0.1-
SNAPSHOT.jar at org.apache.zookeeper.
ClientCnxn$SendThread.run( ClientCnxn.java:1125) 18/02/22 01:43:09 INFO ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:
2181. Will not attempt to authenticate using SASL (unknown error) 18/02/22 01:43:09 WARN ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.
checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.
finishConnect( SocketChannelImpl.java:717) at org.apache.zookeeper.
ClientCnxnSocketNIO. doTransport( ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.
ClientCnxn$SendThread.run( ClientCnxn.java:1125) However it works good when I run it in cluster mode.
Issue 2
Passing the HBase configuration details through the Spark session, which I can't get to work in both client and cluster mode.
Instead of passing the entire hbase-site.xml I am trying to add the configuration directly in the code by adding it as a configuration parameter in the SparkSession, e.g.:
val spark = SparkSession
.builder()
.appName(name)
.config("hbase.zookeeper.
property.clientPort", "2181") .config("hbase.zookeeper.
quorum", "ip1,ip2,ip3") .config("spark.hbase.host","
zookeeperquorum") .getOrCreate()
val json_df =
spark.read.option("catalog",
catalog_read). format("org.apache.spark.sql.
execution.datasources.hbase"). load()
This is not working in cluster mode either.
Can anyone help me with a solution or explanation why this is happening are there any workarounds?