spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sofia’s World <mmistr...@gmail.com>
Subject Re: Scala vs Python for ETL with Spark
Date Fri, 23 Oct 2020 17:37:14 GMT
Hey
 My 2 cents on CI/Cd for pyspark. You can leverage pytests + holden karau's
spark testing libs for CI  thus giving you `almost` same functionality as
Scala - I say almost as in Scala you have nice and descriptive funcspecs -

For me choice is based on expertise.having worked with teams which are 99%
python..the cost of retraining -or even hiring - is too big especially if
you have an existing project and aggressive deadlines
Plz feel free to object
Kind Regards

On Fri, Oct 23, 2020, 1:01 PM William R <rspwilliam@gmail.com> wrote:

> It's really a very big discussion around Pyspark Vs Scala. I have little
> bit experience about how we can automate the CI/CD when it's a JVM based
> language.
> I would like to take this as an opportunity to understand the end-to-end
> CI/CD flow for Pyspark based ETL pipelines.
>
> Could someone please list down the steps how the pipeline automation works
> when it comes to Pyspark based pipelines in Production ?
>
> //William
>
> On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
> wim.vanleuven@highestpoint.biz> wrote:
>
>> I think Sean is right, but in your argumentation you mention that 'functionality
>> is sacrificed in favour of the availability of resources'. That's where I
>> disagree with you but agree with Sean. That is mostly not true.
>>
>> In your previous posts you also mentioned this . The only reason we
>> sometimes have to bail out to Scala is for performance with certain udfs
>>
>> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>>> Thanks for the feedback Sean.
>>>
>>> Kind regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 22 Oct 2020 at 20:34, Sean Owen <srowen@gmail.com> wrote:
>>>
>>>> I don't find this trolling; I agree with the observation that 'the
>>>> skills you have' are a valid and important determiner of what tools you
>>>> pick.
>>>> I disagree that you just have to pick the optimal tool for everything.
>>>> Sounds good until that comes in contact with the real world.
>>>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>>>> you're doing DataFrame operations. By design. So I can't see there being
>>>> one answer to this.
>>>>
>>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>>>> gourav.sengupta@gmail.com> wrote:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> this is turning into a troll now, can you please stop this?
>>>>>
>>>>> No one uses Scala where Python should be used, and no one uses Python
>>>>> where Scala should be used - it all depends on requirements. Everyone
>>>>> understands polyglot programming and how to use relevant technologies
best
>>>>> to their advantage.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>>
>>>>>
>>>>>>>
>
> --
> Regards,
> William R
> +919037075164
>
>
>

Mime
View raw message