spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alonso Isidoro Roman <alons...@gmail.com>
Subject Re: Design patterns involving Spark
Date Tue, 30 Aug 2016 07:33:22 GMT
HBase for real time queries? HBase was designed with the batch in mind.
Impala should be a best choice, but i do not know what Druid can do....


Cheers

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-08-30 8:56 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:

> Hi Chanh,
>
> Druid sounds like a good choice.
>
> But again the point being is that what else Druid brings on top of Hbase.
>
> Unless one decides to use Druid for both historical data and real time
> data in place of Hbase!
>
> It is easier to write API against Druid that Hbase? You still want a UI
> dashboard?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 03:19, Chanh Le <giaosudau@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Seems a lot people using Druid for realtime Dashboard.
>> I’m just wondering of using Druid for main storage engine because Druid
>> can store the raw data and can integrate with Spark also (theoretical).
>> In that case do we need to store 2 separate storage Druid (store segment
>> in HDFS) and HDFS?.
>> BTW did anyone try this one https://github.com/Sparkli
>> neData/spark-druid-olap?
>>
>>
>> Regards,
>> Chanh
>>
>>
>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>> Thanks Bhaarat and everyone.
>>
>> This is an updated version of the same diagram
>>
>> <LambdaArchitecture.png>
>> ​​​
>> The frequency of Recent data is defined by the Windows length in Spark
>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
>> move any Spark granularity below 0.5 seconds in anger. For some
>> applications like Credit card transactions and fraud detection. Data is
>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>> tables.
>> One school of thought is never write to Hive from Spark, write  straight
>> to Hbase and then read Hbase tables into Hive periodically?
>>
>> Now the third component in this layer is Serving Layer that can combine
>> data from the current (Hbase) and the historical (Hive tables) to give the
>> user visual analytics. Now that visual analytics can be Real time dashboard
>> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
>> offering or Data from Hbase (Red Box) combined with Hive tables.
>>
>> I am not aware of any industrial strength Real time Dashboard.  The idea
>> is that one uses such dashboard in real time. Dashboard in this sense
>> meaning a general purpose API to data store of some type like on Serving
>> layer to provide visual analytics real time on demand, combining real time
>> data and aggregate views. As usual the devil in the detail.
>>
>>
>>
>> Let me know your thoughts. Anyway this is first cut pattern.
>>
>> ​​
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 August 2016 at 18:53, Bhaarat Sharma <bhaarat.s@gmail.com> wrote:
>>
>>> Hi Mich
>>>
>>> This is really helpful. I'm trying to wrap my head around the last
>>> diagram you shared (the one with kafka). In this diagram spark streaming is
>>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>>> Queries, Dashboards" annotation. Based on this diagram, will real time
>>> queries be running on Spark or HBase?
>>>
>>> PS: My intention was not to steer the conversation away from what Ashok
>>> asked but I found the diagrams shared by Mich very insightful.
>>>
>>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> In terms of positioning, Spark is really the first Big Data platform to
>>>> integrate batch, streaming and interactive computations in a unified
>>>> framework. What this boils down to is the fact that whichever way one look
>>>> at it there is somewhere that Spark can make a contribution to. In general,
>>>> there are few design patterns common to Big Data
>>>>
>>>>
>>>>
>>>>    - *ETL & Batch*
>>>>
>>>> The first one is the most common one with Established tools like Sqoop,
>>>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the
>>>> execution engine for Hive at the storage level which  actually makes
>>>> it a true vendor independent (BTW, Impala and Tez and LLAP) are offered by
>>>> vendors) processing engine. Personally I use Spark at ETL layer by
>>>> extracting data from sources through plug ins (JDBC and others) and storing
>>>> in on HDFS in some kind
>>>>
>>>>
>>>>
>>>>    - *Batch, real time plus Analytics*
>>>>
>>>> In this pattern you have data coming in real time and you want to query
>>>> them real time through real time dashboard. HDFS is not ideal for updating
>>>> data in real time and neither for random access of data. Source could be
>>>> all sorts of Web Servers and need Flume Agent with Flume. At the storage
>>>> layer we are probably looking at something like Hbase. The crucial point
>>>> being that saved data needs to be ready for queries immediately The
>>>> dashboards requires Hbase APIs. The Analytics can be done through Hive
>>>> again running on Spark engine. Again note here that we ideally should
>>>> process batch and real time separately.
>>>>
>>>>
>>>>
>>>>    - *Real time / Streaming*
>>>>
>>>> This is most relevant to Spark as we are moving to near real time.
>>>> Where Spark excels. We need to capture the incoming events (logs, sensor
>>>> data, pricing, emails) through interfaces like Kafka, Message Queues etc.
>>>>  Need to process these events with minimum latency. Again Spark is a
>>>> very good candidate here with its Spark Streaming and micro-batching
>>>> capabilities. There are others like Storm, Flink etc. that are event based
>>>> but you don’t hear much. Again for streaming architecture you need to sync
>>>> data in real time using something like Hbase, Cassandra (?) and others as
>>>> real time store or forever storage HDFS or Hive etc.
>>>>
>>>>
>>>>             In general there is also *Lambda Architecture* that is
>>>> designed for streaming analytics. The streaming data ends up in both batch
>>>> layer and speed layer. Batch layer is used to answer batch queries. On the
>>>> other hand speed later is used ti handle fast/real time queries. This model
>>>> is really cool as Spark Streaming can feed both the batch layer and
>>>> the speed layer.
>>>>
>>>>
>>>> At a high level this looks like this, from
>>>> http://lambda-architecture.net/
>>>>
>>>> <image.png>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> My favourite would be something like below with Spark playing a major
>>>> role
>>>>
>>>>
>>>> <LambdaArchitecture.png>
>>>> ​
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 28 August 2016 at 19:43, Sivakumaran S <siva.kumaran@me.com> wrote:
>>>>
>>>>> Spark best fits for processing. But depending on the use case, you
>>>>> could expand the scope of Spark to moving data using the native connectors.
>>>>> The only that Spark is not, is Storage. Connectors are available for
most
>>>>> storage options though.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Sivakumaran S
>>>>>
>>>>>
>>>>>
>>>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <ashok34668@yahoo.com.INVALID
>>>>> <ashok34668@yahoo.com.invalid>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> There are design patterns that use Spark extensively. I am new to this
>>>>> area so I would appreciate if someone explains where Spark fits in
>>>>> especially within faster or streaming use case.
>>>>>
>>>>> What are the best practices involving Spark. Is it always best to
>>>>> deploy it for processing engine,
>>>>>
>>>>> For example when we have a pattern
>>>>>
>>>>> Input Data -> Data in Motion -> Processing -> Storage
>>>>>
>>>>> Where does Spark best fit in.
>>>>>
>>>>> Thanking you
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Mime
View raw message