spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Pfarr <z0lt...@pm.me.INVALID>
Subject Re: Bechmarks on Spark running on Yarn versus Spark on K8s
Date Mon, 05 Jul 2021 19:51:40 GMT



Does anyone know where the data for this benchmark was stored?







Spark on YARN gets performance because of data locality via co-allocation of YARN Nodemanager
and HDFS Datanode, not because of the job scheduler, right?







Regards,




z0ltrix

















\-------- Original-Nachricht --------
Am 5. Juli 2021, 21:27, Madaditya .Maddy schrieb:

>
>
>
> I came across an article that benchmarked spark on k8s vs yarn by Datamechanics.
>
>
>
>
> Link : https://www.datamechanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn
>
>
>
>
> \-Regards
>
> Aditya
>
>
>
>
> On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh <[mich.talebzadeh@gmail.com][mich.talebzadeh_gmail.com]>
wrote:
>
>
> > Thanks Yuri. Those are very valid points.
> >
> >
> >
> >
> > Let me clarify my point. Let us assume that we will be using Yarn versus K8s doing
the same job. Spark-submit will use Yarn at first instance and will then switch to using k8s
for the same task.
> >
> >
> >
> >
> > 1.  Have there been such benchmarks?
> > 2.  When should I choose PaaS versus k8s for example for small to medium size jobs
> > 3.  I can see the flexibility of running Spark on Dataproc, then some may argue
that k8s are the way forward
> > 4.  Bear in mind that I am only considering Spark. For example for Kafka and Zookeeper
we opt for dockers as they do a single function.
> >
> >
> >
> >
> > Cheers,
> >
> >
> >
> >
> > Mich
> >
> >
> >
> >
> > ![uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ][][view
my Linkedin profile][]
> >
> > **Disclaimer:** Use it at your own risk.Any and all responsibility for any loss,
damage or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
> >
> >
> >
> >
> >
> >
> >
> > ‪On Mon, 5 Jul 2021 at 19:06, ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎
<[yurkao@gmail.com][yurkao_gmail.com]> wrote:‬
> >
> >
> > > Not a big expert on Spark, but I’m not really understand how you are going
to compare and what? Reading-writing to and from Hdfs? How does it related to yarn and k8s…
these are recourse managers (YARN yet another resource manager) : what and how much to allocate
and when… (cpu, ram).
> > >
> > > Local Disk spilling? Depends on disk throughput…
> > >
> > > So what you are going to measure?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Best regards
> > >
> > >
> > >
> > >
> > > > On 5 Jul 2021, at 20:43, Mich Talebzadeh <[mich.talebzadeh@gmail.com][mich.talebzadeh_gmail.com]>
wrote:
> > > >
> > > >
> > >
> > > > 
> > > >
> > > >
> > > >
> > > >
> > > > I was curious to know if there are benchmarks around on comparison between
Spark on Yarn compared to Kubernetes.
> > > >
> > > >
> > > >
> > > >
> > > > This question arose because traditionally in Google Cloud we have been
using Spark on Dataproc clusters.[ Dataproc][Dataproc] provides Spark, Hadoop plus others
(optional install) for data and analytic processing. It is PaaS
> > > >
> > > >
> > > >
> > > >
> > > > Now they have GKE clusters as well and also introduced [Apache Spark with
Cloud Dataproc on Kubernetes][] which allows one to submit Spark jobs to k8s using Dataproc
stub as a platform to submit the job as below from cloud console or local
> > > >
> > > >
> > > >
> > > >
> > > > gcloud dataproc jobs submit pyspark --cluster="dataproc-for-gke" gs://bucket/testme.py
--region="europe-west2" --py-files gs://bucket/DSBQ.zip
> > > > Job \[e5fc19b62cf744f0b13f3e6d9cc66c19\] submitted.
> > > > Waiting for job output...
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > At the moment it is a struggle to see what merits using k8s instead of
dataproc bar notebooks etc. Actually there is not much literature around with PySpark on k8s.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > For me Spark on bare metal is the preferred option as I cannot see how
one can pigeon hole Spark into a container and make it performant but I may be totally wrong.
> > > >
> > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > >
> > > > ![uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ][][view
my Linkedin profile][]
> > > >
> > > > **Disclaimer:** Use it at your own risk.Any and all responsibility for
any loss, damage or destruction of data or any other property which may arise from relying
on this email's technical content is explicitly disclaimed. The author will in no case be
liable for any monetary damages arising from such loss, damage or destruction.


[mich.talebzadeh_gmail.com]: mailto:mich.talebzadeh@gmail.com
[uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]:
https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ
[view my Linkedin profile]: https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/
[yurkao_gmail.com]: mailto:yurkao@gmail.com
[Dataproc]: https://cloud.google.com/dataproc
[Apache Spark with Cloud Dataproc on Kubernetes]: https://cloud.google.com/blog/products/data-analytics/modernize-apache-spark-with-cloud-dataproc-on-kubernetes
Mime
View raw message