spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <>
Subject Re: Scala vs Python for ETL with Spark
Date Sat, 17 Oct 2020 19:04:01 GMT

May be or I am completely wrong, but I think this is what is happening
here. My sincere apologies in case it inadvertently offends anyone.

Based on Einstein, the quality of what problem/question you are trying to
solve/answer has relation to quality of output. Does the poor quality and
misdirected confusing content of this chain ring a bell? Should we not
spent time in asking the most important thing: "is that question not
completely wrong/ and totally misleading?"

Should the question not be "based on what I am trying to do with my data
(use case) is Scala or Python right choice? Has polygot programming not
been around for years now? Is SPARK 3.x not trying to make data preparation
directly feed to ML algorithms? Are advances in scalable distributed
solutions not already pushing SPARK to be just one of the background data
processing engines like HIVE/PIG/SPARK did to bespoke JAVA map-reduce
programs 5 years back?

A few things that I would like to ask before:
1. By ETL do you mean just plain SQL's for creating datamarts/ data
warehouse/ cubes?
2. With respect to data lakes, and comparing schema-on-read and
schema-on-write should you also not be thinking about ELT as well?
3. Are you using streaming or batch or mixed processing in your platform?
Or what is the time to value you are looking for? All you pipelines cannot
have the same time-to-value requirement, unless you are running just 5
4. What are the different data formats you have? Do you use some bespoke
binary format in IOT, or from other devices?
5. Are you principally creating aggregates/ wide tables/ feature stores
targeted at specific ML algorithms?
6. What are the systems that you are trying to store your output to? Are
they databases, elastic caches/stores, object stores, etc? And why?
7. The list goes on and on and on
8. Once again what is that you are trying to do with your data? What is the
use case? I would always prefer to start with that question, with obvious
exceptions ofcourse.

Once again, I may be completely wrong, but to some I will eventually make
sense, and this answer should be stored in this post.

[image: Screenshot 2020-10-17 at 19.31.24.png]

Gourav Sengupta
(This is an important question that several individuals come across in
their decision path, therefore thought of including my two bits in case it
helps someone)

On Sat, Oct 17, 2020 at 4:57 PM Magnus Nilsson <> wrote:

> Holy war is a bit dramatic don't you think? 🙂 The difference between
> Scala and Python will always be very relevant when choosing between Spark
> and Pyspark. I wouldn't call it irrelevant to the original question.
> br,
> molotch
> On Sat, 17 Oct 2020 at 16:57, "Yuri Oleynikov (‫יורי אולייניקוב‬‎)"
>> wrote:
>> It seems that thread converted to holy war that has nothing to do with
>> original question. If it is, it’s super disappointing
>> Отправлено с iPhone
>> > 17 окт. 2020 г., в 15:53, Molotch <> написал(а):
>> >
>> > I would say the pros and cons of Python vs Scala is both down to
>> Spark, the
>> > languages in themselves and what kind of data engineer you will get
>> when you
>> > try to hire for the different solutions.
>> >
>> > With Pyspark you get less functionality and increased complexity with
>> the
>> > py4j java interop compared to vanilla Spark. Why would you want that?
>> Maybe
>> > you want the Python ML tools and have a clear use case, then go for it.
>> If
>> > not, avoid the increased complexity and reduced functionality of
>> Pyspark.
>> >
>> > Python vs Scala? Idiomatic Python is a lesson in bad programming
>> > habits/ideas, there's no other way to put it. Do you really want
>> programmers
>> > enjoying coding i such a language hacking away at your system?
>> >
>> > Scala might be far from perfect with the plethora of ways to express
>> > yourself. But Python < 3.5 is not fit for anything except simple
>> scripting
>> > IMO.
>> >
>> > Doing exploratory data analysis in a Jupiter notebook, Pyspark seems
>> like a
>> > fine idea. Coding an entire ETL library including state management, the
>> > whole kitchen including the sink, Scala everyday of the week.
>> >
>> >
>> >
>> > --
>> > Sent from:
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail:
>> >
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:

View raw message