** You do NOT need dataframes, I mean.....

On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.ayan@gmail.com> wrote:

Couple of suggestions:

1. Do not use Dataset, use Dataframe in this scenario. There is no benefit of dataset features here. Using Dataframe, you can write an arbitrary UDF which can do what you want to do.
2. In fact you do need dataframes here. You would be better off with RDD here. just create a RDD of symbols and use map to do the processing. 

On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <irving.duran@gmail.com> wrote:

Do you only want to use Scala? Because otherwise, I think with pyspark and pandas read table you should be able to accomplish what you want to accomplish.

Thank you,

Irving Duran
On 02/16/2018 06:10 PM, Lian Jiang wrote:

I have a user case:

I want to download S&P500 stock data from Yahoo API in parallel using Spark. I have got all stock symbols as a Dataset. Then I used below code to call Yahoo API for each symbol:


case class Symbol(symbol: String, sector: String)

case class Tick(symbol: String, sector: String, open: Double, close: Double)

// symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick]

    symbolDs.map { k =>

      pullSymbolFromYahoo(k.symbol, k.sector)


This statement cannot compile:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.

My questions are:

1. As you can see, this scenario is not traditional dataset handling such as count, sql query... Instead, it is more like a UDF which apply random operation on each record. Is Spark good at handling such scenario?

2. Regarding the compilation error, any fix? I did not find a satisfactory solution online.

Thanks for help!

Best Regards,
Ayan Guha

Best Regards,
Ayan Guha