spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: In windows 10, accessing Hive from PySpark with PyCharm throws error
Date Thu, 03 Dec 2020 23:28:03 GMT
This is becoming serious pain.

using powershell I am using spark-submit as follows:

PS C:\Users\admin> spark-submit.cmd
C:\Users\admin\PycharmProjects\pythonProject\main.py

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/D:/temp/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor
java.nio.DirectByteBuffer(long,int)

WARNING: Please consider reporting this to the maintainers of
org.apache.spark.unsafe.Platform

WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations

WARNING: All illegal access operations will be denied in a future release

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties

20/12/03 23:13:59 INFO SparkContext: Running Spark version 3.0.1

20/12/03 23:13:59 INFO ResourceUtils:
==============================================================

20/12/03 23:13:59 INFO ResourceUtils: Resources for spark.driver:


20/12/03 23:13:59 INFO ResourceUtils:
==============================================================

20/12/03 23:13:59 INFO SparkContext: Submitted application: App1

20/12/03 23:13:59 INFO SecurityManager: Changing view acls to: admin

20/12/03 23:13:59 INFO SecurityManager: Changing modify acls to: admin

20/12/03 23:13:59 INFO SecurityManager: Changing view acls groups to:

20/12/03 23:13:59 INFO SecurityManager: Changing modify acls groups to:

20/12/03 23:13:59 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(admin);
groups with view permissions: Set(); users  with modify permissions:
Set(admin); groups with modify permissions: Set()

20/12/03 23:14:00 INFO Utils: Successfully started service 'sparkDriver' on
port 62327.

20/12/03 23:14:00 INFO SparkEnv: Registering MapOutputTracker

20/12/03 23:14:00 INFO SparkEnv: Registering BlockManagerMaster

20/12/03 23:14:01 INFO BlockManagerMasterEndpoint: Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology
information

20/12/03 23:14:01 INFO BlockManagerMasterEndpoint:
BlockManagerMasterEndpoint up

20/12/03 23:14:01 INFO SparkEnv: Registering BlockManagerMasterHeartbeat

20/12/03 23:14:01 INFO DiskBlockManager: Created local directory at
C:\Users\admin\AppData\Local\Temp\blockmgr-30e2019a-af60-44da-86e7-8a162d1e29da

20/12/03 23:14:01 INFO MemoryStore: MemoryStore started with capacity 434.4
MiB

20/12/03 23:14:01 INFO SparkEnv: Registering OutputCommitCoordinator

20/12/03 23:14:01 INFO Utils: Successfully started service 'SparkUI' on
port 4040.

20/12/03 23:14:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://w7:4040

20/12/03 23:14:01 INFO Executor: Starting executor ID driver on host w7

20/12/03 23:14:01 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 62373.

20/12/03 23:14:01 INFO NettyBlockTransferService: Server created on w7:62373

20/12/03 23:14:01 INFO BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
policy

20/12/03 23:14:01 INFO BlockManagerMaster: Registering BlockManager
BlockManagerId(driver, w7, 62373, None)

20/12/03 23:14:01 INFO BlockManagerMasterEndpoint: Registering block
manager w7:62373 with 434.4 MiB RAM, BlockManagerId(driver, w7, 62373, None)

20/12/03 23:14:01 INFO BlockManagerMaster: Registered BlockManager
BlockManagerId(driver, w7, 62373, None)

20/12/03 23:14:01 INFO BlockManager: Initialized BlockManager:
BlockManagerId(driver, w7, 62373, None)

D:\temp\spark\python\lib\pyspark.zip\pyspark\context.py:225:
DeprecationWarning: Support for Python 2 and Python 3 prior to version 3.6
is deprecated as of Spark 3.0. See also the plan for dropping Python 2
support at
https://spark.apache.org/news/plan-for-dropping-python-2-support.html.

  DeprecationWarning)

*20/12/03 23:14:02 INFO SharedState: loading hive config file:
file:/D:/temp/spark/conf/hive-site.xml*

*20/12/03 23:14:02 INFO SharedState: spark.sql.warehouse.dir is not set,
but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to
the value of hive.metastore.warehouse.dir
('C:\Users\admin\PycharmProjects\pythonProject\spark-warehouse').*

*20/12/03 23:14:02 INFO SharedState: Warehouse path is
'C:\Users\admin\PycharmProjects\pythonProject\spark-warehouse'.*

*20/12/03 23:14:04 INFO HiveConf: Found configuration file
file:/D:/temp/spark/conf/hive-site.xml*

*20/12/03 23:14:04 INFO HiveUtils: Initializing HiveMetastoreConnection
version 2.3.7 using Spark classes.*

*Traceback (most recent call last):*

*  File "C:/Users/admin/PycharmProjects/pythonProject/main.py", line 79, in
<module>*

*    spark.sql("CREATE DATABASE IF NOT EXISTS test")*

*  File "D:\temp\spark\python\lib\pyspark.zip\pyspark\sql\session.py", line
649, in sql*

*  File
"D:\temp\spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line
1305, in __call__*

*  File "D:\temp\spark\python\lib\pyspark.zip\pyspark\sql\utils.py", line
134, in deco*

*  File "<string>", line 3, in raise_from*

*pyspark.sql.utils.AnalysisException: java.lang.UnsatisfiedLinkError:
org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V;*

20/12/03 23:14:04 INFO SparkContext: Invoking stop() from shutdown hook

20/12/03 23:14:04 INFO SparkUI: Stopped Spark web UI at http://w7:4040

20/12/03 23:14:04 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!

20/12/03 23:14:04 INFO MemoryStore: MemoryStore cleared

20/12/03 23:14:04 INFO BlockManager: BlockManager stopped

20/12/03 23:14:04 INFO BlockManagerMaster: BlockManagerMaster stopped

20/12/03 23:14:04 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!

20/12/03 23:14:04 INFO SparkContext: Successfully stopped SparkContext

20/12/03 23:14:04 INFO ShutdownHookManager: Shutdown hook called

20/12/03 23:14:04 INFO ShutdownHookManager: Deleting directory
C:\Users\admin\AppData\Local\Temp\spark-2ccc7f91-3970-42e4-b564-6621215dd446

20/12/03 23:14:04 INFO ShutdownHookManager: Deleting directory
C:\Users\admin\AppData\Local\Temp\spark-8015fc12-eff7-4d2e-b4c3-f864bf4b00ce\pyspark-12b6b74c-09a3-447f-be8b-b5aa26fa274d

20/12/03 23:14:04 INFO ShutdownHookManager: Deleting directory
C:\Users\admin\AppData\Local\Temp\spark-8015fc12-eff7-4d2e-b4c3-f864bf4b00ce


So basically it finds hive-site.xml under %SPARK_HOME%/conf directory.
Tries to initialise HiveMetastoreConnection but fails with error


pyspark.sql.utils.AnalysisException: java.lang.UnsatisfiedLinkError:
org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V;


winutils.exe is put under %SPARK_HOME%/bin directory


where winutils.exe

D:\temp\spark\bin\winutils.exe


and permissions chmod -R 777 is set


Also this is hive-site.xml


<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>


  <property>

    <name>hive.exec.local.scratchdir</name>


<value>C:\Users\admin\PycharmProjects\pythonProject\hive-localscratchdir</value>

    <description>Local scratch space for Hive jobs</description>

  </property>


 <property>

    <name>hive.exec.scratchdir</name>


<value>C:\Users\admin\PycharmProjects\pythonProject\hive-scratchdir</value>

    <description>HDFS root scratch dir for Hive jobs which gets created
with write all (733) permission. For each connecting user, an HDFS scratch
dir: ${hive.exec.scratchdir}/&lt;username&gt; is created, with
${hive.scratch.dir.permission}.</description>

  </property>


  <property>

    <name>hive.metastore.warehouse.dir</name>


<value>C:\Users\admin\PycharmProjects\pythonProject\spark-warehouse</value>

    <description>location of default database for the
warehouse</description>

  </property>



  <property>

    <name>spark.sql.warehouse.dir</name>


<value>C:\Users\admin\PycharmProjects\pythonProject\spark-warehouse</value>

    <description>location of default database for the
warehouse</description>

  </property>




  <property>

    <name>hadoop.tmp.dir</name>

    <value>d:\temp\hive\</value>

    <description>A base for other temporary directories.</description>

  </property>


  <property>

   <name>javax.jdo.option.ConnectionURL</name>


 <value>jdbc:derby:C:\Users\admin\PycharmProjects\pythonProject\metastore_db;create=true</value>

   <description>JDBC connect string for a JDBC metastore</description>

  </property>


  <property>

   <name>javax.jdo.option.ConnectionDriverName</name>

   <value>org.apache.derby.EmbeddedDriver</value>

   <description>Driver class name for a JDBC metastore</description>

  </property>


</configuration>


LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 2 Dec 2020 at 23:11, Artemis User <artemis@dtechspace.com> wrote:

> Apparently this is a OS dynamic lib link error.  Make sure you have the
> LD_LIBRARY_PATH (in Linux) or PATH (windows) set up properly for the right
> .so or .dll file...
> On 12/2/20 5:31 PM, Mich Talebzadeh wrote:
>
> Hi,
>
> I have a simple code that tries to create Hive derby database as follows:
>
> from pyspark import SparkContextfrom pyspark.sql import SQLContextfrom pyspark.sql import
HiveContextfrom pyspark.sql import SparkSessionfrom pyspark.sql import Rowfrom pyspark.sql.types
import StringType, ArrayTypefrom pyspark.sql.functions import udf, col, max as max, to_date,
date_add, \
>     add_monthsfrom datetime import datetime, timedeltaimport osfrom os.path import join,
abspathfrom typing import Optionalimport loggingimport randomimport stringimport math
>
> warehouseLocation = 'c:\\Users\\admin\\PycharmProjects\\pythonProject\\spark-warehouse'local_scrtatchdir
= 'c:\\Users\\admin\\PycharmProjects\\pythonProject\\hive-localscratchdir'scrtatchdir = 'c:\\Users\\admin\\PycharmProjects\\pythonProject\\hive-scratchdir'tmp_dir
= 'd:\\temp\\hive'metastore_db = 'jdbc:derby:C:\\Users\\admin\\PycharmProjects\\pythonProject\\metastore_db;create=true'ConnectionDriverName
= 'org.apache.derby.EmbeddedDriver'spark = SparkSession \
>     .builder \
>     .appName("App1") \
>     .config("hive.exec.local.scratchdir", local_scrtatchdir) \
>     .config("hive.exec.scratchdir", scrtatchdir) \
>     .config("spark.sql.warehouse.dir", warehouseLocation) \
>     .config("hadoop.tmp.dir", tmp_dir) \
>     .config("javax.jdo.option.ConnectionURL", metastore_db ) \
>     .config("javax.jdo.option.ConnectionDriverName", ConnectionDriverName) \
>     .enableHiveSupport() \
>     .getOrCreate()print(os.listdir(warehouseLocation))print(os.listdir(local_scrtatchdir))print(os.listdir(scrtatchdir))print(os.listdir(tmp_dir))
> sc = SparkContext.getOrCreate()
> sqlContext = SQLContext(sc)
> HiveContext = HiveContext(sc)
> spark.sql("CREATE DATABASE IF NOT EXISTS test")
>
>
> Now this comes back with the following:
>
>
> C:\Users\admin\PycharmProjects\pythonProject\venv\Scripts\python.exe
> C:/Users/admin/PycharmProjects/pythonProject/main.py
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
>
> Setting default log level to "WARN".
>
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
>
> []
>
> []
>
> []
>
> ['hive-localscratchdir', 'hive-scratchdir', 'hive-warehouse']
>
> Traceback (most recent call last):
>
>   File "C:/Users/admin/PycharmProjects/pythonProject/main.py", line 76, in
> <module>
>
>     spark.sql("CREATE DATABASE IF NOT EXISTS test")
>
>   File "D:\temp\spark\python\pyspark\sql\session.py", line 649, in sql
>
>     return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>
>   File
> "D:\temp\spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line
> 1305, in __call__
>
>   File "D:\temp\spark\python\pyspark\sql\utils.py", line 134, in deco
>
>     raise_from(converted)
>
>   File "<string>", line 3, in raise_from
>
> *pyspark.sql.utils.AnalysisException: java.lang.UnsatisfiedLinkError:
> org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V;*
>
>
> Process finished with exit code 1
>
> Also under %SPARK_HOME%/conf I also have hive-site.xml file. It is not
> obvious to me why it is throwing this error?
>
> Thanks
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Mime
View raw message