spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anastasios Zouzias <zouz...@gmail.com>
Subject Re: Can spark handle this scenario?
Date Sat, 17 Feb 2018 19:05:18 GMT
Hi Lian,

The remaining problem is:


Spark need all classes used in the fn() serializable for t.rdd.map{ k=>
fn(k) } to work. This could be hard since some classes in third party
libraries are not serializable. This restricts the power of using spark to
parallel an operation on multiple machines. Hope this is clear.


This is not entirely true. You can bypass the serialisation issue in most
cases, see the link below for an example.

https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/

In a nutshell, the non-serialisable code is available to all executors, so
there is no need for Spark to serialise from the driver to the executors.

Best regards,
Anastasios




On Sat, Feb 17, 2018 at 6:13 PM, Lian Jiang <jiangok2006@gmail.com> wrote:

> *Snehasish,*
>
> I got this in spark-shell 2.11.8:
>
> case class My(name:String, age:Int)
>
> import spark.implicits._
>
> val t = List(new My("lian", 20), new My("sh", 3)).toDS
>
> t.map{ k=> print(My) }(org.apache.spark.sql.Encoders.kryo[My.getClass])
>
>
> <console>:31: error: type getClass is not a member of object My
>
>        t.map{ k=> print(My) }(org.apache.spark.sql.
> Encoders.kryo[My.getClass])
>
>
>
> Using RDD can workaround this issue as mentioned in previous emails:
>
>
>  t.rdd.map{ k=> print(k) }
>
>
> *Holden,*
>
>
> The remaining problem is:
>
>
> Spark need all classes used in the fn() serializable for t.rdd.map{ k=>
> fn(k) } to work. This could be hard since some classes in third party
> libraries are not serializable. This restricts the power of using spark to
> parallel an operation on multiple machines. Hope this is clear.
>
>
> On Sat, Feb 17, 2018 at 12:04 AM, SNEHASISH DUTTA <
> info.snehasish@gmail.com> wrote:
>
>> Hi  Lian,
>>
>> This could be the solution
>>
>>
>> case class Symbol(symbol: String, sector: String)
>>
>> case class Tick(symbol: String, sector: String, open: Double, close:
>> Double)
>>
>>
>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick]
>>
>>
>>     symbolDs.map { k =>
>>
>>       pullSymbolFromYahoo(k.symbol, k.sector)
>>
>>     }(org.apache.spark.sql.Encoders.kryo[Tick.getClass])
>>
>>
>> Thanks,
>>
>> Snehasish
>>
>>
>> Regards,
>> Snehasish
>>
>> On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau <holden.karau@gmail.com>
>> wrote:
>>
>>> I'm not sure what you mean by it could be hard to serialize complex
>>> operations?
>>>
>>> Regardless I think the question is do you want to parallelize this on
>>> multiple machines or just one?
>>>
>>> On Feb 17, 2018 4:20 PM, "Lian Jiang" <jiangok2006@gmail.com> wrote:
>>>
>>>> Thanks Ayan. RDD may support map better than Dataset/DataFrame.
>>>> However, it could be hard to serialize complex operation for Spark to
>>>> execute in parallel. IMHO, spark does not fit this scenario. Hope this
>>>> makes sense.
>>>>
>>>> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.ayan@gmail.com> wrote:
>>>>
>>>>> ** You do NOT need dataframes, I mean.....
>>>>>
>>>>> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.ayan@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Couple of suggestions:
>>>>>>
>>>>>> 1. Do not use Dataset, use Dataframe in this scenario. There is no
>>>>>> benefit of dataset features here. Using Dataframe, you can write
an
>>>>>> arbitrary UDF which can do what you want to do.
>>>>>> 2. In fact you do need dataframes here. You would be better off with
>>>>>> RDD here. just create a RDD of symbols and use map to do the processing.
>>>>>>
>>>>>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <
>>>>>> irving.duran@gmail.com> wrote:
>>>>>>
>>>>>>> Do you only want to use Scala? Because otherwise, I think with
>>>>>>> pyspark and pandas read table you should be able to accomplish
what you
>>>>>>> want to accomplish.
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Irving Duran
>>>>>>>
>>>>>>> On 02/16/2018 06:10 PM, Lian Jiang wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a user case:
>>>>>>>
>>>>>>> I want to download S&P500 stock data from Yahoo API in parallel
>>>>>>> using Spark. I have got all stock symbols as a Dataset. Then
I used below
>>>>>>> code to call Yahoo API for each symbol:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> case class Symbol(symbol: String, sector: String)
>>>>>>>
>>>>>>> case class Tick(symbol: String, sector: String, open: Double,
close:
>>>>>>> Double)
>>>>>>>
>>>>>>>
>>>>>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns
>>>>>>> Dataset[Tick]
>>>>>>>
>>>>>>>
>>>>>>>     symbolDs.map { k =>
>>>>>>>
>>>>>>>       pullSymbolFromYahoo(k.symbol, k.sector)
>>>>>>>
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>>> This statement cannot compile:
>>>>>>>
>>>>>>>
>>>>>>> Unable to find encoder for type stored in a Dataset.  Primitive
>>>>>>> types (Int, String, etc) and Product types (case classes) are
supported by
>>>>>>> importing spark.implicits._  Support for serializing other types
>>>>>>> will be added in future releases.
>>>>>>>
>>>>>>>
>>>>>>> My questions are:
>>>>>>>
>>>>>>>
>>>>>>> 1. As you can see, this scenario is not traditional dataset handling
>>>>>>> such as count, sql query... Instead, it is more like a UDF which
apply
>>>>>>> random operation on each record. Is Spark good at handling such
scenario?
>>>>>>>
>>>>>>>
>>>>>>> 2. Regarding the compilation error, any fix? I did not find a
>>>>>>> satisfactory solution online.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for help!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>
>


-- 
-- Anastasios Zouzias
<azo@zurich.ibm.com>

Mime
View raw message