spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Girish Vasmatkar <>
Subject Use SparkContext in Web Application
Date Mon, 01 Oct 2018 06:48:48 GMT
Hi All

We are very early into our Spark days so the following may sound like a
novice question :) I will try to keep this as short as possible.

We are trying to use Spark to introduce a recommendation engine that can be
used to provide product recommendations and need help on some design
decisions before moving forward. Ours is a web application running on
Tomcat. So far, I have created a simple POC (standalone java program) that
reads in a CSV file and feeds to FPGrowth and then fits the data and runs
transformations. I would like to be able to do the following -

   - Scheduler runs nightly in Tomcat (which it does currently) and reads
   everything from the DB to train/fit the system. This can grow into really
   some large data and everyday we will have new data. Should I just use
   SparkContext here, within my scheduler, to FIT the system? Is this correct
   way to go about this? I am also planning to save the model on S3 which
   should be okay. We also thought on using HDFS. The scheduler's job will be
   just to create model and save the same and be done with it.
   - On the product page, we can then use the saved model to display the
   product recommendations for a particular product.
   - My understanding is that I should be able to use SparkContext here in
   my web application to just load the saved model and use it to derive the
   recommendations. Is this a good design? The problem I see using this
   approach is that the SparkContext does take time to initialize and this may
   cost dearly. Or should we keep SparkContext per web application to use a
   single instance of the same? We can initialize a SparkContext during
   application context initializaion phase.

Since I am fairly new to using Spark properly, please help me take decision
on whether the way I plan to use Spark is the recommended way? I have also
seen use cases involving kafka tha does communication with Spark, but can
we not do it directly using Spark Context? I am sure a lot of my
understanding is wrong, so please feel free to correct me.

Thanks and Regards,
Girish Vasmatkar
HotWax Systems

View raw message