spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Can spark handle this scenario?
Date Sat, 17 Feb 2018 17:53:45 GMT
You may want to think about separating the import step from the processing step. It is not
very economical to download all the data again every time you want to calculate something.
So download it first and store it on a distributed file system. Schedule to download newest
information every day/ hour etc. you can store it using a query optimized format such as ORC
or Parquet. Then you can run queries over it.

> On 17. Feb 2018, at 01:10, Lian Jiang <jiangok2006@gmail.com> wrote:
> 
> Hi,
> 
> I have a user case:
> 
> I want to download S&P500 stock data from Yahoo API in parallel using Spark. I have
got all stock symbols as a Dataset. Then I used below code to call Yahoo API for each symbol:
> 
>        
> case class Symbol(symbol: String, sector: String)
> case class Tick(symbol: String, sector: String, open: Double, close: Double)
> 
> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick]
> 
>     symbolDs.map { k =>
>       pullSymbolFromYahoo(k.symbol, k.sector)
>     }
> 
> This statement cannot compile:
> 
> Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc)
and Product types (case classes) are supported by importing spark.implicits._  Support for
serializing other types will be added in future releases.
> 
> 
> My questions are:
> 
> 1. As you can see, this scenario is not traditional dataset handling such as count, sql
query... Instead, it is more like a UDF which apply random operation on each record. Is Spark
good at handling such scenario?
> 
> 2. Regarding the compilation error, any fix? I did not find a satisfactory solution online.
> 
> Thanks for help!
> 
> 
> 

Mime
View raw message