spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Design patterns for Spark implementation
Date Sat, 10 Dec 2016 14:06:45 GMT
Hi Sachin,

The idea of using Spark on RDBMS to do complex queries is interesting and
will mature as SQL on Spark gets closer to ANSI.

There are a number of challenges here:


   1. The application owners prefer to stay on RDBMS
   2. The application backend is based on a primary DB and multiple
   replicates in different geographical (US, UK, Singapore) locations. These
   replicate databases provide massive reporting capabilities
   3. They don't particularly want to migrate to Big Data
   4. However, they like faster performance
   5. The volume of data is too large for an IMDB database
   6. One choice is to use Spark with its MPP capabilities on RDBMS tables
   7. Requires performant JDBC with multi-threading
   8. Then do the calculation in Spark itself

There are a number of hurdles. One is whether the same SQL on RDBMS will
work equally well with Spark. Ideally temp tables/views in Spark can be
used or Spark functional programming with Scala. Also the performance of
JDBC matters.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 December 2016 at 12:51, Sachin Naik <sachin.u.naik@gmail.com> wrote:

> Mich:
>
> I have some prior experience on creating a custom massively parallel
> loader/extractor using ODBC/JDBC and now starting to peek into Spark
> internals.
>
> I am extremely interested in your findings.
>
> Also feel free to reach out if you need help the Connector logic etc.
>
> --Sachin
>
> Sent from my iPhone
>
> On Dec 8, 2016, at 2:54 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
> Another use case for Spark is to use its in-memory and parallel processing
> on RDBMS data.
>
> This may sound a bit strange, but you can access your RDBMS table from
> Spark via JDBC with parallel processing and engage the speed of Spark to
> accelerate the queries.
>
> To do this you may need to parallelise you JDBC connection to RDBMS table
> and you will need to have a primary key on the table.
>
> I am going to test it to see how performant it is to offer Spark as a fast
> query engine for RDNMS.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 8 December 2016 at 19:51, Sachin Naik <sachin.u.naik@gmail.com> wrote:
>
>> Not sure if you are aware of these....
>>
>> 1) Edx/Berkely/Databricks has three Spark related certifications. Might
>> be a good start.
>>
>> 2) Fair understanding of scala/distributed collection patterns to better
>> appreciate the internals of Spark. Coursera has three scala courses. I know
>> there are other language bindings. The Edx course goes in great detail on
>> those.
>>
>> 3) Advanced Analytics on Spark book.
>>
>> --sachin
>>
>> Sent from my iPhone
>>
>> On Dec 8, 2016, at 11:38 AM, Peter Figliozzi <pete.figliozzi@gmail.com>
>> wrote:
>>
>> Keeping in mind Spark is a parallel computing engine, Spark does not
>> change your data infrastructure/data architecture.  These days it's
>> relatively convenient to read data from a variety of sources (S3, HDFS,
>> Cassandra, ...) and ditto on the output side.
>>
>> For example, for one of my use-cases, I store 10's of gigs of time-series
>> data in Cassandra.  It just so happens I like to analyze all of it at once
>> using Spark, which writes a very nice, small text file table of results I
>> look at using Python/Pandas, in a Jupyter notebook, on a laptop.
>>
>> If we didn't have Spark, I'd still be doing the input side (Cassandra)
>> and output side (small text file, ingestible by a laptop) the same way.
>> The only difference would be, instead of importing and processing in Spark,
>> my fictional group of 5,000 assistants would each download a portion of the
>> data into their Excel spreadsheet, then have a big meeting to produce my
>> small text file.
>>
>> So my view is the nature of your data and specific objectives determine
>> your infrastructure and architecture, not the presence or absence of Spark.
>>
>>
>>
>>
>>
>> On Sat, Dec 3, 2016 at 10:59 AM, Vasu Gourabathina <vgouraba@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I know this is a broad question. If this is not the right forum,
>>> appreciate if you can point to other sites/areas that may be helpful.
>>>
>>> Before posing this question, I did use our friend Google, but sanitizing
>>> the query results from my need angle hasn't been easy.
>>>
>>> Who I am:
>>>    - Have done data processing and analytics, but relatively new to
>>> Spark world
>>>
>>> What I am looking for:
>>>   - Architecture/Design of a ML system using Spark
>>>   - In particular, looking for best practices that can support/bridge
>>> both Engineering and Data Science teams
>>>
>>> Engineering:
>>>    - Build a system that has typical engineering needs, data processing,
>>> scalability, reliability, availability, fault-tolerance etc.
>>>    - System monitoring etc.
>>> Data Science:
>>>    - Build a system for Data Science team to do data exploration
>>> activities
>>>    - Develop models using supervised learning and tweak models
>>>
>>> Data:
>>>   - Batch and incremental updates - mostly structured or semi-structured
>>> (some data from transaction systems, weblogs, click stream etc.)
>>>   - Steaming, in near term, but not to begin with
>>>
>>> Data Storage:
>>>   - Data is expected to grow on a daily basis...so, system should be
>>> able to support and handle big data
>>>   - May be, after further analysis, there might be a possibility/need to
>>> archive some of the data...it all depends on how the ML models were built
>>> and results were stored/used for future usage
>>>
>>> Data Analysis:
>>>   - Obvious data related aspects, such as data cleansing, data
>>> transformation, data partitioning etc
>>>   - May be run models on windows of data. For example: last 1-year,
>>> 2-years etc.
>>>
>>> ML models:
>>>   - Ability to store model versions and previous results
>>>   - Compare results of different variants of models
>>>
>>> Consumers:
>>>   - RESTful webservice clients to look at the results
>>>
>>> *So, the questions I have are:*
>>> 1) Are there architectural and design patterns that I can use based on
>>> industry best-practices. In particular:
>>>       - data ingestion
>>>       - data storage (for eg. go with HDFS or not)
>>>       - data partitioning, especially in Spark world
>>>       - running parallel ML models and combining results etc.
>>>       - consumption of final results by clients (for eg. by pushing
>>> results to Cassandra, NoSQL dbs etc.)
>>>
>>> Again, I know this is a broad question....Pointers to some
>>> best-practices in some of the areas, if not all, would be highly
>>> appreciated. Open to purchase any books that may have relevant information.
>>>
>>> Thanks much folks,
>>> Vasu.
>>>
>>>
>>
>

Mime
View raw message