spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Design patterns involving Spark
Date Mon, 29 Aug 2016 17:19:47 GMT
Interesting points Ayan.

What are the available real time dashboards to Spark? I don't believer
Zeppelin, QlikView or Tableau can b e considered as real time dashboards.

Thanks




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 August 2016 at 14:02, ayan guha <guha.ayan@gmail.com> wrote:

> In addition to what Mitch explained, Spark is the sole platform where you
> can extend batch semantics to real time semantics within same framework. I
> will suggest you to watch last spark summit videos,, esp Michael A's demo
> on structured streaming, which shows how powerful such easy transition
> between semantics can become.
> On 29 Aug 2016 22:20, "Ashok Kumar" <ashok34668@yahoo.com.invalid> wrote:
>
>> Hi,
>>
>> Thank you for your explanations. Very fruitful.
>>
>> Warmest
>>
>>
>> On Monday, 29 August 2016, 0:18, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> In terms of positioning, Spark is really the first Big Data platform to
>> integrate batch, streaming and interactive computations in a unified
>> framework. What this boils down to is the fact that whichever way one look
>> at it there is somewhere that Spark can make a contribution to. In general,
>> there are few design patterns common to Big Data
>>
>>
>>    - *ETL & Batch*
>>
>> The first one is the most common one with Established tools like Sqoop,
>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the
>> execution engine for Hive at the storage level which  actually makes it
>> a true vendor independent (BTW, Impala and Tez and LLAP) are offered by
>> vendors) processing engine. Personally I use Spark at ETL layer by
>> extracting data from sources through plug ins (JDBC and others) and storing
>> in on HDFS in some kind
>>
>>
>>    - *Batch, real time plus Analytics*
>>
>> In this pattern you have data coming in real time and you want to query
>> them real time through real time dashboard. HDFS is not ideal for updating
>> data in real time and neither for random access of data. Source could be
>> all sorts of Web Servers and need Flume Agent with Flume. At the storage
>> layer we are probably looking at something like Hbase. The crucial point
>> being that saved data needs to be ready for queries immediately The
>> dashboards requires Hbase APIs. The Analytics can be done through Hive
>> again running on Spark engine. Again note here that we ideally should
>> process batch and real time separately.
>>
>>
>>    - *Real time / Streaming*
>>
>> This is most relevant to Spark as we are moving to near real time. Where
>> Spark excels. We need to capture the incoming events (logs, sensor data,
>> pricing, emails) through interfaces like Kafka, Message Queues etc.  Need
>> to process these events with minimum latency. Again Spark is a very good
>> candidate here with its Spark Streaming and micro-batching capabilities.
>> There are others like Storm, Flink etc. that are event based but you don’t
>> hear much. Again for streaming architecture you need to sync data in real
>> time using something like Hbase, Cassandra (?) and others as real time
>> store or forever storage HDFS or Hive etc.
>>
>>             In general there is also *Lambda Architecture* that is
>> designed for streaming analytics. The streaming data ends up in both batch
>> layer and speed layer. Batch layer is used to answer batch queries. On the
>> other hand speed later is used ti handle fast/real time queries. This model
>> is really cool as Spark Streaming can feed both the batch layer and
>> the speed layer.
>>
>> At a high level this looks like this, from http://lambda-architecture.net
>> /
>>
>> [image: Inline images 2]
>>
>>
>>
>>
>>
>> My favourite would be something like below with Spark playing a major role
>>
>>
>>
>> ​
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>> On 28 August 2016 at 19:43, Sivakumaran S <siva.kumaran@me.com> wrote:
>>
>> Spark best fits for processing. But depending on the use case, you could
>> expand the scope of Spark to moving data using the native connectors. The
>> only that Spark is not, is Storage. Connectors are available for most
>> storage options though.
>>
>> Regards,
>>
>> Sivakumaran S
>>
>>
>>
>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <ashok34668@yahoo.com.INVALID
>> <ashok34668@yahoo.com.invalid>> wrote:
>>
>> Hi,
>>
>> There are design patterns that use Spark extensively. I am new to this
>> area so I would appreciate if someone explains where Spark fits in
>> especially within faster or streaming use case.
>>
>> What are the best practices involving Spark. Is it always best to deploy
>> it for processing engine,
>>
>> For example when we have a pattern
>>
>> Input Data -> Data in Motion -> Processing -> Storage
>>
>> Where does Spark best fit in.
>>
>> Thanking you
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>

Mime
View raw message