spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Naik <sachin.u.n...@gmail.com>
Subject Re: Design patterns for Spark implementation
Date Thu, 08 Dec 2016 19:51:45 GMT
Not sure if you are aware of these....

1) Edx/Berkely/Databricks has three Spark related certifications. Might be a good start. 

2) Fair understanding of scala/distributed collection patterns to better appreciate the internals
of Spark. Coursera has three scala courses. I know there are other language bindings. The
Edx course goes in great detail on those. 

3) Advanced Analytics on Spark book. 

--sachin

Sent from my iPhone

> On Dec 8, 2016, at 11:38 AM, Peter Figliozzi <pete.figliozzi@gmail.com> wrote:
> 
> Keeping in mind Spark is a parallel computing engine, Spark does not change your data
infrastructure/data architecture.  These days it's relatively convenient to read data from
a variety of sources (S3, HDFS, Cassandra, ...) and ditto on the output side.  
> 
> For example, for one of my use-cases, I store 10's of gigs of time-series data in Cassandra.
 It just so happens I like to analyze all of it at once using Spark, which writes a very nice,
small text file table of results I look at using Python/Pandas, in a Jupyter notebook, on
a laptop. 
> 
> If we didn't have Spark, I'd still be doing the input side (Cassandra) and output side
(small text file, ingestible by a laptop) the same way.  The only difference would be, instead
of importing and processing in Spark, my fictional group of 5,000 assistants would each download
a portion of the data into their Excel spreadsheet, then have a big meeting to produce my
small text file.
> 
> So my view is the nature of your data and specific objectives determine your infrastructure
and architecture, not the presence or absence of Spark.
> 
> 
> 
> 
> 
>> On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina <vgouraba@gmail.com> wrote:
>> Hi,
>> 
>> I know this is a broad question. If this is not the right forum, appreciate if you
can point to other sites/areas that may be helpful.
>> 
>> Before posing this question, I did use our friend Google, but sanitizing the query
results from my need angle hasn't been easy.
>> 
>> Who I am: 
>>    - Have done data processing and analytics, but relatively new to Spark world
>> 
>> What I am looking for:
>>   - Architecture/Design of a ML system using Spark
>>   - In particular, looking for best practices that can support/bridge both Engineering
and Data Science teams
>> 
>> Engineering:
>>    - Build a system that has typical engineering needs, data processing, scalability,
reliability, availability, fault-tolerance etc.
>>    - System monitoring etc.
>> Data Science:
>>    - Build a system for Data Science team to do data exploration activities
>>    - Develop models using supervised learning and tweak models
>> 
>> Data:
>>   - Batch and incremental updates - mostly structured or semi-structured (some data
from transaction systems, weblogs, click stream etc.)
>>   - Steaming, in near term, but not to begin with
>> 
>> Data Storage:
>>   - Data is expected to grow on a daily basis...so, system should be able to support
and handle big data
>>   - May be, after further analysis, there might be a possibility/need to archive
some of the data...it all depends on how the ML models were built and results were stored/used
for future usage
>> 
>> Data Analysis:
>>   - Obvious data related aspects, such as data cleansing, data transformation, data
partitioning etc
>>   - May be run models on windows of data. For example: last 1-year, 2-years etc.
>> 
>> ML models:
>>   - Ability to store model versions and previous results
>>   - Compare results of different variants of models
>>  
>> Consumers:
>>   - RESTful webservice clients to look at the results
>> 
>> So, the questions I have are:
>> 1) Are there architectural and design patterns that I can use based on industry best-practices.
In particular:    
>>       - data ingestion
>>       - data storage (for eg. go with HDFS or not)
>>       - data partitioning, especially in Spark world
>>       - running parallel ML models and combining results etc.
>>       - consumption of final results by clients (for eg. by pushing results to Cassandra,
NoSQL dbs etc.)
>> 
>> Again, I know this is a broad question....Pointers to some best-practices in some
of the areas, if not all, would be highly appreciated. Open to purchase any books that may
have relevant information.
>> 
>> Thanks much folks,
>> Vasu.
>> 
> 

Mime
View raw message