spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Femi Anthony <femib...@gmail.com>
Subject Re: Scala vs Python for ETL with Spark
Date Sat, 17 Oct 2020 21:16:30 GMT
The answer to your question IMO is, it depends. I've been using Pyspark for
about 4 years now, and its served my needs very well doing ETL with Spark.
As Chairman Mao once said "let a thousand flowers bloom". I'd rather be too
busy getting work done than fighting over the merits of whether using Scala
makes one a "purer" developer than Python.
I find Python to be very effective at getting what I need to get done at
work and doing so very productively.
Fundamentalism often leads to bad results in most cases - except maybe
mathematics.

My 2 cents.

On Sat, Oct 17, 2020 at 4:29 PM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Hi,
>
> First  apologies to all members. I can assure you that my intention was
> not to start a turf war and get emotions high. Anyway, as for myself, I
> managed to pick up a lot of useful comments here that enriched my
> understanding of these two popular tools in the service of the ubiquitous
> Spark and moreover how to consider what language to deploy as per
> circumstances (pros/cons). Perhaps ironically that is one reason we have
> these forums since one cannot experience everything on his/her own,so we
> exchange ideas and we agree to differ if needed :)
>
> Needless to say every product has its own fans and people rarely change
> their mind (based on their personal experience) but it is always good to
> consider the opposing arguments.
>
> In fairness to Python, it is heavily used in the world of workflow
> orchestration, for example with Apache Airflow
> <http://airflow.apache.org/docs/stable/> and Cloud Composer
> <https://cloud.google.com/composer> as a fully managed service which I
> guess is a natural consideration for ETL as well.
>
> Cheers,
>
> Mich
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 17 Oct 2020 at 20:05, Gourav Sengupta <gourav.sengupta@gmail.com>
> wrote:
>
>>
>> Hi,
>>
>> May be or I am completely wrong, but I think this is what is happening
>> here. My sincere apologies in case it inadvertently offends anyone.
>>
>> Based on Einstein, the quality of what problem/question you are trying to
>> solve/answer has relation to quality of output. Does the poor quality and
>> misdirected confusing content of this chain ring a bell? Should we not
>> spent time in asking the most important thing: "is that question not
>> completely wrong/ and totally misleading?"
>>
>> Should the question not be "based on what I am trying to do with my data
>> (use case) is Scala or Python right choice? Has polygot programming not
>> been around for years now? Is SPARK 3.x not trying to make data preparation
>> directly feed to ML algorithms? Are advances in scalable distributed
>> solutions not already pushing SPARK to be just one of the background data
>> processing engines like HIVE/PIG/SPARK did to bespoke JAVA map-reduce
>> programs 5 years back?
>>
>> A few things that I would like to ask before:
>> 1. By ETL do you mean just plain SQL's for creating datamarts/ data
>> warehouse/ cubes?
>> 2. With respect to data lakes, and comparing schema-on-read and
>> schema-on-write should you also not be thinking about ELT as well?
>> 3. Are you using streaming or batch or mixed processing in your platform?
>> Or what is the time to value you are looking for? All you pipelines cannot
>> have the same time-to-value requirement, unless you are running just 5
>> pipelines.
>> 4. What are the different data formats you have? Do you use some bespoke
>> binary format in IOT, or from other devices?
>> 5. Are you principally creating aggregates/ wide tables/ feature stores
>> targeted at specific ML algorithms?
>> 6. What are the systems that you are trying to store your output to? Are
>> they databases, elastic caches/stores, object stores, etc? And why?
>> 7. The list goes on and on and on
>> 8. Once again what is that you are trying to do with your data? What is
>> the use case? I would always prefer to start with that question, with
>> obvious exceptions ofcourse.
>>
>> Once again, I may be completely wrong, but to some I will eventually make
>> sense, and this answer should be stored in this post.
>>
>>
>> [image: Screenshot 2020-10-17 at 19.31.24.png]
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>> (This is an important question that several individuals come across in
>> their decision path, therefore thought of including my two bits in case it
>> helps someone)
>>
>> On Sat, Oct 17, 2020 at 4:57 PM Magnus Nilsson <magnn@kth.se> wrote:
>>
>>> Holy war is a bit dramatic don't you think? 🙂 The difference between
>>> Scala and Python will always be very relevant when choosing between Spark
>>> and Pyspark. I wouldn't call it irrelevant to the original question.
>>>
>>> br,
>>>
>>> molotch
>>>
>>> On Sat, 17 Oct 2020 at 16:57, "Yuri Oleynikov (‫יורי אולייניקוב‬‎)"
<
>>> yurkao@gmail.com> wrote:
>>>
>>>> It seems that thread converted to holy war that has nothing to do with
>>>> original question. If it is, it’s super disappointing
>>>>
>>>> Отправлено с iPhone
>>>>
>>>> > 17 окт. 2020 г., в 15:53, Molotch <magnn@kth.se> написал(а):
>>>> >
>>>> > I would say the pros and cons of Python vs Scala is both down to
>>>> Spark, the
>>>> > languages in themselves and what kind of data engineer you will get
>>>> when you
>>>> > try to hire for the different solutions.
>>>> >
>>>> > With Pyspark you get less functionality and increased complexity with
>>>> the
>>>> > py4j java interop compared to vanilla Spark. Why would you want that?
>>>> Maybe
>>>> > you want the Python ML tools and have a clear use case, then go for
>>>> it. If
>>>> > not, avoid the increased complexity and reduced functionality of
>>>> Pyspark.
>>>> >
>>>> > Python vs Scala? Idiomatic Python is a lesson in bad programming
>>>> > habits/ideas, there's no other way to put it. Do you really want
>>>> programmers
>>>> > enjoying coding i such a language hacking away at your system?
>>>> >
>>>> > Scala might be far from perfect with the plethora of ways to express
>>>> > yourself. But Python < 3.5 is not fit for anything except simple
>>>> scripting
>>>> > IMO.
>>>> >
>>>> > Doing exploratory data analysis in a Jupiter notebook, Pyspark seems
>>>> like a
>>>> > fine idea. Coding an entire ETL library including state management,
>>>> the
>>>> > whole kitchen including the sink, Scala everyday of the week.
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>

-- 
http://dataphantik.com

"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Mime
View raw message