spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Structuring a PySpark Application
Date Wed, 30 Jun 2021 15:32:29 GMT
Hi Kartik,

Can you explain how you create your zip file? Does that include all in your
top project directory as per PyCharm etc.

The rest looks Ok as you are creating a Python Virtual Env

python3 -m venv pyspark_venv
source pyspark_venv/bin/activate

How do you create that requirements_spark.txt file?

pip install -r requirements_spark.txt
pip install venv-pack


Where is this gz file used?
venv-pack -o pyspark_venv.tar.gz

Because I am not clear about below line

--conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \

It helps if you walk us through the shell itself for clarification HTH,

Mich




   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 30 Jun 2021 at 15:47, Kartik Ohri <kartikohri13@gmail.com> wrote:

> Hi all!
>
> I am working on a Pyspark application and would like suggestions on how it
> should be structured.
>
> We have a number of possible jobs, organized in modules. There is also a "
> RequestConsumer
> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>"
> class which consumes from a messaging queue. Each message contains the name
> of the job to invoke and the arguments to be passed to it. Messages are put
> into the message queue by cronjobs, manually etc.
>
> We submit a zip file containing all python files to a Spark cluster
> running on YARN and ask it to run the RequestConsumer. This
> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34>
> is the exact spark-submit command for the interested. The results of the
> jobs are collected
> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122>
> by the request consumer and pushed into another queue.
>
> My question is whether this type of structure makes sense. Should the
> Request Consumer instead run independently of Spark and invoke spark-submit
> scripts when it needs to trigger a job? Or is there another recommendation?
>
> Thank you all in advance for taking the time to read this email and
> helping.
>
> Regards,
> Kartik.
>
>
>

Mime
View raw message