spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasu Gourabathina <vgour...@gmail.com>
Subject Re: Design patterns for Spark implementation
Date Mon, 05 Dec 2016 20:36:11 GMT
Thanks Dr. Mich.

Haven't heard from anyone else. If anyone else wants to share their
opinion, that'd be much appreciated.

Thanks,
Vasu.


On Sun, Dec 4, 2016 at 12:01 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Spark is an important component in many big data design patterns so it has
> be considered within the context of overall solution for big data.
>
> So starting from Spark Streaming to Spark SQL as a powerful query engine
> on top of data stores such as Hive, Hbase etc, Spark plays a central role.
>
> Have a look around on classic Big Data Architecture designs such as Lambda
> Architecture and others and you will see where Spark fits in.
>
> I attach a typical Lambda Architecture for Financial risk where spark is
> the central component for the speed layer in case it helps.
>
>
> ‚Äč
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 December 2016 at 17:32, Pradeep Gaddam <Pradeep.Gaddam@viewglass.com>
> wrote:
>
>> I was hoping for someone to answer this question, As it  resonates with
>> many developers who are new to Spark and trying to adopt it at their work.
>> Regards
>> Pradeep
>>
>> On Dec 3, 2016, at 9:00 AM, Vasu Gourabathina <vgouraba@gmail.com> wrote:
>>
>> Hi,
>>
>> I know this is a broad question. If this is not the right forum,
>> appreciate if you can point to other sites/areas that may be helpful.
>>
>> Before posing this question, I did use our friend Google, but sanitizing
>> the query results from my need angle hasn't been easy.
>>
>> Who I am:
>>    - Have done data processing and analytics, but relatively new to Spark
>> world
>>
>> What I am looking for:
>>   - Architecture/Design of a ML system using Spark
>>   - In particular, looking for best practices that can support/bridge
>> both Engineering and Data Science teams
>>
>> Engineering:
>>    - Build a system that has typical engineering needs, data processing,
>> scalability, reliability, availability, fault-tolerance etc.
>>    - System monitoring etc.
>> Data Science:
>>    - Build a system for Data Science team to do data exploration
>> activities
>>    - Develop models using supervised learning and tweak models
>>
>> Data:
>>   - Batch and incremental updates - mostly structured or semi-structured
>> (some data from transaction systems, weblogs, click stream etc.)
>>   - Steaming, in near term, but not to begin with
>>
>> Data Storage:
>>   - Data is expected to grow on a daily basis...so, system should be able
>> to support and handle big data
>>   - May be, after further analysis, there might be a possibility/need to
>> archive some of the data...it all depends on how the ML models were built
>> and results were stored/used for future usage
>>
>> Data Analysis:
>>   - Obvious data related aspects, such as data cleansing, data
>> transformation, data partitioning etc
>>   - May be run models on windows of data. For example: last 1-year,
>> 2-years etc.
>>
>> ML models:
>>   - Ability to store model versions and previous results
>>   - Compare results of different variants of models
>>
>> Consumers:
>>   - RESTful webservice clients to look at the results
>>
>> *So, the questions I have are:*
>> 1) Are there architectural and design patterns that I can use based on
>> industry best-practices. In particular:
>>       - data ingestion
>>       - data storage (for eg. go with HDFS or not)
>>       - data partitioning, especially in Spark world
>>       - running parallel ML models and combining results etc.
>>       - consumption of final results by clients (for eg. by pushing
>> results to Cassandra, NoSQL dbs etc.)
>>
>> Again, I know this is a broad question....Pointers to some best-practices
>> in some of the areas, if not all, would be highly appreciated. Open to
>> purchase any books that may have relevant information.
>>
>> Thanks much folks,
>> Vasu.
>>
>>
>>
>> This message and any attachments may contain confidential information of
>> View, Inc. If you are not the intended recipient you are hereby notified
>> that any dissemination, copying or distribution of this message, or files
>> associated with this message, is strictly prohibited. If you have received
>> this message in error, please notify us immediately by replying to the
>> message and delete the message from your computer.
>>
>
>

Mime
View raw message