spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Edelhaus <edel...@gmail.com>
Subject Re: Scala Vs Python
Date Sun, 04 Sep 2016 08:17:32 GMT
Any thoughts about Spark and Erlang?


-- ttfn
Simon Edelhaus
California 2016

On Sun, Sep 4, 2016 at 1:00 AM, ayan guha <guha.ayan@gmail.com> wrote:

> Hi
>
> This one is quite interesting. Is it possible to share few toy examples?
>
> On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson <assaf.mendelson@rsa.com>
> wrote:
>
>> I am not aware of any official testing but you can easily create your own.
>>
>> In testing I made I saw that python UDF were more than 10 times slower
>> than scala UDF (and in some cases it was closer to 50 times slower).
>>
>> That said, it would depend on how you use your UDF.
>>
>> For example, lets say you have a 1 billion row table which you do some
>> aggregation on and left with a 10K rows table. If you do the python UDF in
>> the beginning then it might have a hard hit but if you do it on the 10K
>> rows table then the overhead might be negligible.
>>
>> Furthermore, you can always write the UDF in scala and wrap it.
>>
>> This is something my team did. We have data scientists working on spark
>> in python. Normally, they can use the existing functions to do what they
>> need (Spark already has a pretty nice spread of functions which answer most
>> of the common use cases). When they need a new UDF or UDAF they simply ask
>> my team (which does the engineering) and we write them a scala one and then
>> wrap it to be accessible from python.
>>
>>
>>
>>
>>
>> *From:* ayan guha [mailto:[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=27650&i=0>]
>> *Sent:* Friday, September 02, 2016 12:21 AM
>> *To:* kant kodali
>> *Cc:* Mendelson, Assaf; user
>> *Subject:* Re: Scala Vs Python
>>
>>
>>
>> Thanks All for your replies.
>>
>>
>>
>> Feature Parity:
>>
>>
>>
>> MLLib, RDD and dataframes features are totally comparable. Streaming is
>> now at par in functionality too, I believe. However, what really worries me
>> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>>
>>
>>
>> Performance:
>>
>> I do  get this bit when RDDs are involved, but not when Data frame is the
>> only construct I am operating on.  Dataframe supposed to be
>> language-agnostic in terms of performance.  So why people think python is
>> slower? is it because of using UDF? Any other reason?
>>
>>
>>
>> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
>> comparison? like the one out there  b/w RDDs.*
>>
>>
>>
>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>> applications only. I would be glad to hear your opinion on why pyspark
>> applications will not work, if you have any benchmarks please share if
>> possible.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=27650&i=1>> wrote:
>>
>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>> write a 10 page essay on why that wouldn't work so great. you might be
>> wondering why would spark have it then? well probably because its ease of
>> use for ML (that would be my best guess).
>>
>>
>>
>>
>>
>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=27650&i=2> wrote:
>>
>> I believe this would greatly depend on your use case and your familiarity
>> with the languages.
>>
>>
>>
>> In general, scala would have a much better performance than python and
>> not all interfaces are available in python.
>>
>> That said, if you are planning to use dataframes without any UDF then the
>> performance hit is practically nonexistent.
>>
>> Even if you need UDF, it is possible to write those in scala and wrap
>> them for python and still get away without the performance hit.
>>
>> Python does not have interfaces for UDAFs.
>>
>>
>>
>> I believe that if you have large structured data and do not generally
>> need UDF/UDAF you can certainly work in python without losing too much.
>>
>>
>>
>>
>>
>> *From:* ayan guha [mailto:[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=27637&i=0>]
>> *Sent:* Thursday, September 01, 2016 5:03 AM
>> *To:* user
>> *Subject:* Scala Vs Python
>>
>>
>>
>> Hi Users
>>
>>
>>
>> Thought to ask (again and again) the question: While I am building any
>> production application, should I use Scala or Python?
>>
>>
>>
>> I have read many if not most articles but all seems pre-Spark 2. Anything
>> changed with Spark 2? Either pro-scala way or pro-python way?
>>
>>
>>
>> I am thinking performance, feature parity and future direction, not so
>> much in terms of skillset or ease of use.
>>
>>
>>
>> Or, if you think it is a moot point, please say so as well.
>>
>>
>>
>> Any real life example, production experience, anecdotes, personal taste,
>> profanity all are welcome :)
>>
>>
>>
>> --
>>
>> Best Regards,
>> Ayan Guha
>>
>>
>> ------------------------------
>>
>> View this message in context: RE: Scala Vs Python
>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>>
>>
>>
>>
>> --
>>
>> Best Regards,
>> Ayan Guha
>>
>> ------------------------------
>> View this message in context: RE: Scala Vs Python
>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27650.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message