spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lian Jiang <jiangok2...@gmail.com>
Subject Re: Can spark handle this scenario?
Date Sat, 17 Feb 2018 07:19:46 GMT
Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, it
could be hard to serialize complex operation for Spark to execute in
parallel. IMHO, spark does not fit this scenario. Hope this makes sense.

On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.ayan@gmail.com> wrote:

> ** You do NOT need dataframes, I mean.....
>
> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.ayan@gmail.com> wrote:
>
>> Hi
>>
>> Couple of suggestions:
>>
>> 1. Do not use Dataset, use Dataframe in this scenario. There is no
>> benefit of dataset features here. Using Dataframe, you can write an
>> arbitrary UDF which can do what you want to do.
>> 2. In fact you do need dataframes here. You would be better off with RDD
>> here. just create a RDD of symbols and use map to do the processing.
>>
>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <irving.duran@gmail.com>
>> wrote:
>>
>>> Do you only want to use Scala? Because otherwise, I think with pyspark
>>> and pandas read table you should be able to accomplish what you want to
>>> accomplish.
>>>
>>> Thank you,
>>>
>>> Irving Duran
>>>
>>> On 02/16/2018 06:10 PM, Lian Jiang wrote:
>>>
>>> Hi,
>>>
>>> I have a user case:
>>>
>>> I want to download S&P500 stock data from Yahoo API in parallel using
>>> Spark. I have got all stock symbols as a Dataset. Then I used below code to
>>> call Yahoo API for each symbol:
>>>
>>>
>>>
>>> case class Symbol(symbol: String, sector: String)
>>>
>>> case class Tick(symbol: String, sector: String, open: Double, close:
>>> Double)
>>>
>>>
>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns
>>> Dataset[Tick]
>>>
>>>
>>>     symbolDs.map { k =>
>>>
>>>       pullSymbolFromYahoo(k.symbol, k.sector)
>>>
>>>     }
>>>
>>>
>>> This statement cannot compile:
>>>
>>>
>>> Unable to find encoder for type stored in a Dataset.  Primitive types
>>> (Int, String, etc) and Product types (case classes) are supported by
>>> importing spark.implicits._  Support for serializing other types will
>>> be added in future releases.
>>>
>>>
>>> My questions are:
>>>
>>>
>>> 1. As you can see, this scenario is not traditional dataset handling
>>> such as count, sql query... Instead, it is more like a UDF which apply
>>> random operation on each record. Is Spark good at handling such scenario?
>>>
>>>
>>> 2. Regarding the compilation error, any fix? I did not find a
>>> satisfactory solution online.
>>>
>>>
>>> Thanks for help!
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message