spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chanh Le <giaosu...@gmail.com>
Subject Re: Design patterns involving Spark
Date Tue, 30 Aug 2016 02:19:28 GMT
Hi everyone,

Seems a lot people using Druid for realtime Dashboard.
I’m just wondering of using Druid for main storage engine because Druid can store the raw
data and can integrate with Spark also (theoretical). 
In that case do we need to store 2 separate storage Druid (store segment in HDFS) and HDFS?.
BTW did anyone try this one https://github.com/SparklineData/spark-druid-olap <https://github.com/SparklineData/spark-druid-olap>?


Regards,
Chanh


> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Thanks Bhaarat and everyone.
> 
> This is an updated version of the same diagram
> 
> <LambdaArchitecture.png>
> ​​​
> The frequency of Recent data is defined by the Windows length in Spark Streaming. It
can vary between 0.5 seconds to an hour. ( Don't think we can move any Spark granularity below
0.5 seconds in anger. For some applications like Credit card transactions and fraud detection.
Data is stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as well. The
same Spark Streaming will write asynchronously to HDFS Hive tables.
> One school of thought is never write to Hive from Spark, write  straight to Hbase and
then read Hbase tables into Hive periodically?
> 
> Now the third component in this layer is Serving Layer that can combine data from the
current (Hbase) and the historical (Hive tables) to give the user visual analytics. Now that
visual analytics can be Real time dashboard on top of Serving Layer. That Serving layer could
be an in-memory NoSQL offering or Data from Hbase (Red Box) combined with Hive tables.
> 
> I am not aware of any industrial strength Real time Dashboard.  The idea is that one
uses such dashboard in real time. Dashboard in this sense meaning a general purpose API to
data store of some type like on Serving layer to provide visual analytics real time on demand,
combining real time data and aggregate views. As usual the devil in the detail.
> 
> 
> 
> Let me know your thoughts. Anyway this is first cut pattern.
> 
> ​​
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  
> 
> On 29 August 2016 at 18:53, Bhaarat Sharma <bhaarat.s@gmail.com <mailto:bhaarat.s@gmail.com>>
wrote:
> Hi Mich
> 
> This is really helpful. I'm trying to wrap my head around the last diagram you shared
(the one with kafka). In this diagram spark streaming is pushing data to HDFS and NoSql. However,
I'm confused by the "Real Time Queries, Dashboards" annotation. Based on this diagram, will
real time queries be running on Spark or HBase?
> 
> PS: My intention was not to steer the conversation away from what Ashok asked but I found
the diagrams shared by Mich very insightful. 
> 
> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <mailto:mich.talebzadeh@gmail.com>>
wrote:
> Hi,
> 
> In terms of positioning, Spark is really the first Big Data platform to integrate batch,
streaming and interactive computations in a unified framework. What this boils down to is
the fact that whichever way one look at it there is somewhere that Spark can make a contribution
to. In general, there are few design patterns common to Big Data
>  
> ETL & Batch
> The first one is the most common one with Established tools like Sqoop, Talend for ETL
and HDFS for storage of some kind. Spark can be used as the execution engine for Hive at the
storage level which  actually makes it a true vendor independent (BTW, Impala and Tez and
LLAP) are offered by vendors) processing engine. Personally I use Spark at ETL layer by extracting
data from sources through plug ins (JDBC and others) and storing in on HDFS in some kind
>  
> Batch, real time plus Analytics
> In this pattern you have data coming in real time and you want to query them real time
through real time dashboard. HDFS is not ideal for updating data in real time and neither
for random access of data. Source could be all sorts of Web Servers and need Flume Agent with
Flume. At the storage layer we are probably looking at something like Hbase. The crucial point
being that saved data needs to be ready for queries immediately The dashboards requires Hbase
APIs. The Analytics can be done through Hive again running on Spark engine. Again note here
that we ideally should process batch and real time separately.   
>  
> Real time / Streaming
> This is most relevant to Spark as we are moving to near real time. Where Spark excels.
We need to capture the incoming events (logs, sensor data, pricing, emails) through interfaces
like Kafka, Message Queues etc.  Need to process these events with minimum latency. Again
Spark is a very good candidate here with its Spark Streaming and micro-batching capabilities.
There are others like Storm, Flink etc. that are event based but you don’t hear much. Again
for streaming architecture you need to sync data in real time using something like Hbase,
Cassandra (?) and others as real time store or forever storage HDFS or Hive etc.
>  
>             In general there is also Lambda Architecture that is designed for streaming
analytics. The streaming data ends up in both batch layer and speed layer. Batch layer is
used to answer batch queries. On the other hand speed later is used ti handle fast/real time
queries. This model is really cool as Spark Streaming can feed both the batch layer and the
speed layer.
>  
> At a high level this looks like this, from http://lambda-architecture.net/ <http://lambda-architecture.net/>
> 
> <image.png>
> 
> 
> 
> 
> 
> My favourite would be something like below with Spark playing a major role
> 
>  
> <LambdaArchitecture.png>
> ​
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  
> 
> On 28 August 2016 at 19:43, Sivakumaran S <siva.kumaran@me.com <mailto:siva.kumaran@me.com>>
wrote:
> Spark best fits for processing. But depending on the use case, you could expand the scope
of Spark to moving data using the native connectors. The only that Spark is not, is Storage.
Connectors are available for most storage options though.
> 
> Regards,
> 
> Sivakumaran S
> 
> 
> 
> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <ashok34668@yahoo.com.INVALID <mailto:ashok34668@yahoo.com.invalid>>
wrote:
> 
>> Hi,
>> 
>> There are design patterns that use Spark extensively. I am new to this area so I
would appreciate if someone explains where Spark fits in especially within faster or streaming
use case.
>> 
>> What are the best practices involving Spark. Is it always best to deploy it for processing
engine, 
>> 
>> For example when we have a pattern 
>> 
>> Input Data -> Data in Motion -> Processing -> Storage 
>> 
>> Where does Spark best fit in.
>> 
>> Thanking you 
> 
> 
> 


Mime
View raw message