spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Scala vs Python for ETL with Spark
Date Sat, 10 Oct 2020 20:38:39 GMT
Not quite sure how meaningful this discussion is, but in case someone is
really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach
here thought that those days were over almost 10 years ago.
Regards
Gourav

On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <javadba@gmail.com> wrote:

> I agree with Wim's assessment of data engineering / ETL vs Data Science.
>   I wrote pipelines/frameworks for large companies and scala was a much
> better choice. But for ad-hoc work interfacing directly with data science
> experiments pyspark presents less friction.
>
> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
>> Many thanks everyone for their valuable contribution.
>>
>> We all started with Spark a few years ago where Scala was the talk of the
>> town. I agree with the note that as long as Spark stayed nish and elite,
>> then someone with Scala knowledge was attracting premiums. In fairness in
>> 2014-2015, there was not much talk of Data Science input (I may be wrong).
>> But the world has moved on so to speak. Python itself has been around
>> a long time (long being relative here). Most people either knew UNIX Shell,
>> C, Python or Perl or a combination of all these. I recall we had a director
>> a few years ago who asked our Hadoop admin for root password to log in to
>> the edge node. Later he became head of machine learning somewhere else and
>> he loved C and Python. So Python was a gift in disguise. I think Python
>> appeals to those who are very familiar with CLI and shell programming (Not
>> GUI fan). As some members alluded to there are more people around with
>> Python knowledge. Most managers choose Python as the unifying development
>> tool because they feel comfortable with it. Frankly I have not seen a
>> manager who feels at home with Scala. So in summary it is a bit
>> disappointing to abandon Scala and switch to Python just for the sake of it.
>>
>> Disclaimer: These are opinions and not facts so to speak :)
>>
>> Cheers,
>>
>>
>> Mich
>>
>>
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Mime
View raw message