spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashok Kumar <>
Subject Re: Design patterns involving Spark
Date Mon, 29 Aug 2016 12:19:41 GMT
Thank you for your explanations. Very fruitful.

    On Monday, 29 August 2016, 0:18, Mich Talebzadeh <> wrote:

In terms of positioning,Spark is really the first Big Data platform to integrate batch, streaming
andinteractive computations in a unified framework. What this boils down to is thefact that
whichever way one look at it there is somewhere that Spark can make acontribution to. In general,
there are few design patterns common to Big Data    
   - ETL & Batch
Thefirst one is the most common one with Established tools like Sqoop, Talend forETL and HDFS
for storage of some kind. Spark can be used as the executionengine for Hive at the storage
level which  actually makes it a true vendor independent(BTW, Impala and Tez and LLAP) are
offered by vendors) processing engine.Personally I use Spark at ETL layer by extracting data
from sources throughplug ins (JDBC and others) and storing in on HDFS in some kind    
   - Batch, real time plus Analytics
In thispattern you have data coming in real time and you want to query them real timethrough
real time dashboard. HDFS is not ideal for updating data in real timeand neither for random
access of data. Source could be all sorts of Web Serversand need Flume Agent with Flume. At
the storage layer we are probably looking atsomething like Hbase. The crucial point being
that saved data needs to be readyfor queries immediately The dashboards requires Hbase APIs.
The Analytics canbe done through Hive again running on Spark engine. Again note here that
weideally should process batch and real time separately.       
   - Real time / Streaming
This ismost relevant to Spark as we are moving to near real time. Where Spark excels.We need
to capture the incoming events (logs, sensor data, pricing, emails) throughinterfaces like
Kafka, Message Queues etc.  Need to process these events with minimumlatency. Again Spark
is a very good candidate here with its Spark Streaming andmicro-batching capabilities. There
are others like Storm, Flink etc. that areevent based but you don’t hear much. Again for
streaming architecture you needto sync data in real time using something like Hbase, Cassandra
(?) and othersas real time store or forever storage HDFS or Hive etc.            
In general there is also Lambda Architecture that isdesigned for streaming analytics. The
streaming data ends up in both batchlayer and speed layer. Batch layer is used to answer batch
queries. On theother hand speed later is used ti handle fast/real time queries. This model
isreally cool as Spark Streaming can feed both the batch layer and the speed layer.  At
a high level this lookslike this, from

My favourite would besomething like below with Spark playing a major role
Dr Mich Talebzadeh LinkedIn
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or
destructionof data or any other property which may arise from relying on this email's technical content
is explicitly disclaimed.The author will in no case be liable for any monetary damages arising
from suchloss, damage or destruction.  
On 28 August 2016 at 19:43, Sivakumaran S <> wrote:

Spark best fits for processing. But depending on the use case, you could expand the scope
of Spark to moving data using the native connectors. The only that Spark is not, is Storage.
Connectors are available for most storage options though.
Sivakumaran S

On 28-Aug-2016, at 6:04 PM, Ashok Kumar <> wrote:

There are design patterns that use Spark extensively. I am new to this area so I would appreciate
if someone explains where Spark fits in especially within faster or streaming use case.
What are the best practices involving Spark. Is it always best to deploy it for processing
For example when we have a pattern 
Input Data -> Data in Motion -> Processing -> Storage 
Where does Spark best fit in.
Thanking you 

View raw message