spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <>
Subject Re: Structuring a PySpark Application
Date Thu, 01 Jul 2021 00:08:13 GMT

I think that reading Matei Zaharia's book "SPARK the definitive guide" will
be a good and best starting point.

Gourav Sengupta

On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri <> wrote:

> Hi all!
> I am working on a Pyspark application and would like suggestions on how it
> should be structured.
> We have a number of possible jobs, organized in modules. There is also a "
> RequestConsumer
> <>"
> class which consumes from a messaging queue. Each message contains the name
> of the job to invoke and the arguments to be passed to it. Messages are put
> into the message queue by cronjobs, manually etc.
> We submit a zip file containing all python files to a Spark cluster
> running on YARN and ask it to run the RequestConsumer. This
> <>
> is the exact spark-submit command for the interested. The results of the
> jobs are collected
> <>
> by the request consumer and pushed into another queue.
> My question is whether this type of structure makes sense. Should the
> Request Consumer instead run independently of Spark and invoke spark-submit
> scripts when it needs to trigger a job? Or is there another recommendation?
> Thank you all in advance for taking the time to read this email and
> helping.
> Regards,
> Kartik.

View raw message