spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Structuring a PySpark Application
Date Fri, 02 Jul 2021 09:06:26 GMT
Hi Kartik,

If you run this shell script for multiple spark-submit jobs you may end up
with a virtual environment deleted when another is using it. Virtual
environments should not really change much except when packages are added
or updated.

So this script will avoid deleting the virtual environment if it is already
created.

#!/bin/bash
set -e

pyspark_venv="pyspark_venv"
source_code="/home/hduser/dba/bin/python/DSBQ"
if [[ ! -d ${pyspark_venv} ]]
then
        echo `date` ", ===> virtual environment $pyspark_venv does not
exist, creating it"
        /usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 -m venv
${pyspark_venv}
else
        echo `date` ", ===> virtual environment $pyspark_venv already exist"
fi

echo `date` ", ===> sourcing virtual environment $pyspark_venv"
[ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz
[ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip
echo `date` ", ===> creating source zip directory from  ${source_code}"
zip -rq ${source_code}.zip . -i ${source_code}

source ${pyspark_venv}/bin/activate
echo `date` ", ===> Add additional packages as needed from requirements
files"
pip install -r requirements.txt
pip install -r requirements_spark.txt
echo `date` ", ===> Create a gz file to be used in spark-submit"
pip install venv-pack
venv-pack -o ${pyspark_venv}.tar.gz

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./${pyspark_venv}/bin/python

echo `date` ", ===> Submitting spark job"
spark-submit \
        --master local[4] \
        --conf
"spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#${pyspark_venv} \
        --py-files ${source_code}.zip \
        /home/hduser/dba/bin/python/DSBQ/src/RandomData.py

echo `date` ", ===> Cleaning up files"
[ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz
[ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 12:37, Kartik Ohri <kartikohri13@gmail.com> wrote:

> Hi Mich!
>
> The shell script indeed looks more robust now :D
>
> Yes, the current setup works fine. I am wondering whether it is the right
> way to set up things? That is, should I run the program which accepts
> requests from the queue independently and have it invoke spark-submit cli
> or something else?
>
> Thanks again.
>
> Regards
>
> On Thu, Jul 1, 2021 at 4:44 PM Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
>> Hi Kartik,
>>
>> I parameterized your shell script and tested on a stob python file and
>> looks OK, ensuring that the shell script is more robust
>>
>>
>> #!/bin/bash
>> set -e
>>
>> #cd "$(dirname "${BASH_SOURCE[0]}")/../"
>>
>> pyspark_venv="pyspark_venv"
>> source_zip_file="DSBQ.zip"
>> [ -d ${pyspark_venv} ] && rm -r -d ${pyspark_venv}
>> [ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz
>> [ -f ${source_zip_file} ] && rm -r -f ${source_zip_file}
>>
>> python3 -m venv ${pyspark_venv}
>> source ${pyspark_venv}/bin/activate
>> pip install -r requirements_spark.txt
>> pip install venv-pack
>> venv-pack -o ${pyspark_venv}.tar.gz
>>
>> export PYSPARK_DRIVER_PYTHON=python
>> export PYSPARK_PYTHON=./${pyspark_venv}/bin/python
>> spark-submit \
>>         --master local[4] \
>>         --conf
>> "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#${pyspark_venv} \
>>         /home/hduser/dba/bin/python/dynamic_ARRAY_generator_parquet.py
>>
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 30 Jun 2021 at 19:21, Kartik Ohri <kartikohri13@gmail.com> wrote:
>>
>>> Hi Mich!
>>>
>>> We use this in production but indeed there is much scope for
>>> improvements, configuration being one of those :).
>>>
>>> Yes, we have a private on-premise cluster. We run Spark on YARN (no
>>> airflow etc.) which controls the scheduling and use HDFS as a datastore.
>>>
>>> Regards
>>>
>>> On Wed, Jun 30, 2021 at 11:41 PM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Thanks for the details Kartik.
>>>>
>>>> Let me go through these. The code itself and indentation looks good.
>>>>
>>>> One minor thing I noticed is that you are not using a yaml file
>>>> (config.yml) for your variables and you seem to embed them in your
>>>> config.py code. That is what I used to do before :) a friend advised me to
>>>> initialise with yaml and read them in python file. However, I guess that
is
>>>> a personal style.
>>>>
>>>> Overall looking neat. I believe you are running all these on-premises
>>>> and not using airflow or composer for your scheduling.
>>>>
>>>>
>>>> Cheers
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 30 Jun 2021 at 18:39, Kartik Ohri <kartikohri13@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Mich!
>>>>>
>>>>> Thanks for the reply.
>>>>>
>>>>> The zip file contains all of the spark related
>>>>> code, particularly contents of this folder
>>>>> <https://github.com/metabrainz/listenbrainz-server/tree/master/listenbrainz_spark>
>>>>> .
>>>>> The requirements_spark.txt
>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/requirements_spark.txt>
is
>>>>> contained in the project and it contains the non-spark dependencies of
the
>>>>> python code.
>>>>> The tar.gz file is created according to Pyspark docs
>>>>> <https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv>
for
>>>>> dependency management. The spark.yarn.dist.archives also comes from
>>>>> there.
>>>>>
>>>>> This is the python file
>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/spark_manage.py>
>>>>> invoked by the spark-submit to start the "RequestConsumer".
>>>>>
>>>>> Regards,
>>>>> Kartik
>>>>>
>>>>>
>>>>> On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Hi Kartik,
>>>>>>
>>>>>> Can you explain how you create your zip file? Does that include all
>>>>>> in your top project directory as per PyCharm etc.
>>>>>>
>>>>>> The rest looks Ok as you are creating a Python Virtual Env
>>>>>>
>>>>>> python3 -m venv pyspark_venv
>>>>>> source pyspark_venv/bin/activate
>>>>>>
>>>>>> How do you create that requirements_spark.txt file?
>>>>>>
>>>>>> pip install -r requirements_spark.txt
>>>>>> pip install venv-pack
>>>>>>
>>>>>>
>>>>>> Where is this gz file used?
>>>>>> venv-pack -o pyspark_venv.tar.gz
>>>>>>
>>>>>> Because I am not clear about below line
>>>>>>
>>>>>> --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment
\
>>>>>>
>>>>>> It helps if you walk us through the shell itself for clarification
>>>>>> HTH,
>>>>>>
>>>>>> Mich
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property
which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary
damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 30 Jun 2021 at 15:47, Kartik Ohri <kartikohri13@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all!
>>>>>>>
>>>>>>> I am working on a Pyspark application and would like suggestions
on
>>>>>>> how it should be structured.
>>>>>>>
>>>>>>> We have a number of possible jobs, organized in modules. There
is
>>>>>>> also a "RequestConsumer
>>>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>"
>>>>>>> class which consumes from a messaging queue. Each message contains
the name
>>>>>>> of the job to invoke and the arguments to be passed to it. Messages
are put
>>>>>>> into the message queue by cronjobs, manually etc.
>>>>>>>
>>>>>>> We submit a zip file containing all python files to a Spark cluster
>>>>>>> running on YARN and ask it to run the RequestConsumer. This
>>>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34>
>>>>>>> is the exact spark-submit command for the interested. The results
of the
>>>>>>> jobs are collected
>>>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122>
>>>>>>> by the request consumer and pushed into another queue.
>>>>>>>
>>>>>>> My question is whether this type of structure makes sense. Should
>>>>>>> the Request Consumer instead run independently of Spark and invoke
>>>>>>> spark-submit scripts when it needs to trigger a job? Or is there
another
>>>>>>> recommendation?
>>>>>>>
>>>>>>> Thank you all in advance for taking the time to read this email
and
>>>>>>> helping.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Kartik.
>>>>>>>
>>>>>>>
>>>>>>>

Mime
View raw message