spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: PyCharm, Running spark-submit calling jars and a package at run time
Date Sat, 09 Jan 2021 01:46:19 GMT
Well, I decided to have a go at this.

As I understand PyCharm is the glue that holds all these modules together
and resolves dependencies internally. This means that all imports in
modules can be taken care of.

When one runs the module alone through the command line in the virtual
environment at Teminal, dependencies are not resolved so things don't work
and imports from within modules are thrown errors. In short spark-submit
has no recollection of dependencies (please correct me if this is wrong)

After thinking about it (this is purely a run time issue with Spark) and
looking around under the directory %SPARK_HOME% (you have set this up in
windows user environment) like below

[image: image.png]


This is the Spark version that PyCharm/PySpark uses. Go to the jars
directory in C:\spark-3.0.1-bin-hadoop2.7\jars and put your needed jar
files there.


[image: image.png]



Then go back to PyCharm and run the module INSIDE the PyCharm itself the
usual way (right click the module), it works and uses that jar file to read
BigQuery table from PyCharm with Spark. It should pickup that jar file and
works.


spark.conf.set("GcpJsonKeyFile",v.jsonKeyFile)
spark.conf.set("BigQueryProjectId",v.projectId)
spark.conf.set("BigQueryDatasetLocation",v.datasetLocation)
spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("fs.gs.project.id", v.projectId)
spark.conf.set("fs.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark.conf.set("fs.AbstractFileSystem.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.conf.set("temporaryGcsBucket", v.tmp_bucket)

sqltext = ""
from pyspark.sql.window import Window

# read data from the Bigquery table in staging area
print("\nreading data from "+v.projectId+":"+v.inputTable)
source_df = spark.read. \
              format("bigquery"). \
              option("dataset", v.sourceDataset). \
              option("table", v.sourceTable). \
              load()

This is the output.


[image: image.png]


The next step for me is how to figure out running a package at runtime i.e.
spark-submit --package <PACKAGE>


HTH,


Mich

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 8 Jan 2021 at 17:32, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

>
> Just to clarify, are you referring to module dependencies in PySpark?
>
>
> With Scala I can create a Uber jar file inclusive of all bits and pieces
> built with maven or sbt that works in a cluster and submit to spark-submit
> as a uber jar file.
>
>
> what alternatives would you suggest for PySpark, a zip file?
>
>
> cheers,
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 8 Jan 2021 at 17:18, Sean Owen <srowen@gmail.com> wrote:
>
>> THis isn't going to help submitting to a remote cluster though. You need
>> to explicitly include dependencies in your submit.
>>
>> On Fri, Jan 8, 2021 at 11:15 AM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi Riccardo
>>>
>>> This is the env variables at runtime
>>>
>>> PYTHONUNBUFFERED=1;*PYTHONPATH=*
>>> C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src
>>>
>>> This is the configuration set up for analyze_house_prices_GCP
>>>
>>> [image: image.png]
>>>
>>>
>>>
>>>
>>> So like in Linux, I created a windows env variable and on PyCharm
>>> terminal, I can see it
>>>
>>>
>>>
>>> (venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>*echo
>>> %PYTHONPATH%*
>>>
>>>
>>> PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\
>>>
>>>
>>> ;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src
>>>
>>> It picks up sparkstuff.py
>>>
>>>
>>> (venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>*where
>>> sparkstuff.py*
>>>
>>> C:\Users\admin\PycharmProjects\packages\sparkutils\sparkstuff.py
>>>
>>> But in spark-submit within the code it does not
>>>
>>> (venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit
>>> --jars ..\spark-bigquery-with-dependencies_2.12-0.18.0.jar
>>> analyze_house_prices_GCP
>>> .py
>>> Traceback (most recent call last):
>>>   File
>>> "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py",
>>> line 8, in <module>
>>>     import sparkstuff as s
>>> ModuleNotFoundError: No module named 'sparkutils'
>>>
>>> thanks
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 8 Jan 2021 at 16:38, Riccardo Ferrari <ferrarir@gmail.com>
>>> wrote:
>>>
>>>> I think spark checks the python path env variable. Need to provide that.
>>>> Of course that works in local mode only
>>>>
>>>> On Fri, Jan 8, 2021, 5:28 PM Sean Owen <srowen@gmail.com> wrote:
>>>>
>>>>> I don't see anywhere that you provide 'sparkstuff'? how would the
>>>>> Spark app have this code otherwise?
>>>>>
>>>>> On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Thanks Riccardo.
>>>>>>
>>>>>> I am well aware of the submission form
>>>>>>
>>>>>> However, my question relates to doing submission within PyCharm
>>>>>> itself.
>>>>>>
>>>>>> This is what I do at Pycharm *terminal* to invoke the module python
>>>>>>
>>>>>> spark-submit --jars
>>>>>> ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
>>>>>>  --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>>>>>> analyze_house_prices_GCP.py
>>>>>>
>>>>>> However, at terminal run it does not pickup import dependencies in
>>>>>> the code!
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>>   File
>>>>>> "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py",
>>>>>> line 8, in <module>
>>>>>>     import sparkstuff as s
>>>>>> ModuleNotFoundError: No module named 'sparkstuff'
>>>>>>
>>>>>> The python code is attached, pretty simple
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Mime
View raw message