spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alonso Isidoro Roman <alons...@gmail.com>
Subject Re: Design patterns involving Spark
Date Tue, 30 Aug 2016 08:10:52 GMT
Thanks Mitch, i will check it.

Cheers


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-08-30 9:52 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:

> You can use Hbase for building real time dashboards
>
> Check this link
> <https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 August 2016 at 08:33, Alonso Isidoro Roman <alonsoir@gmail.com>
> wrote:
>
>> HBase for real time queries? HBase was designed with the batch in mind.
>> Impala should be a best choice, but i do not know what Druid can do....
>>
>>
>> Cheers
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>
>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>>
>>> Hi Chanh,
>>>
>>> Druid sounds like a good choice.
>>>
>>> But again the point being is that what else Druid brings on top of
>>> Hbase.
>>>
>>> Unless one decides to use Druid for both historical data and real time
>>> data in place of Hbase!
>>>
>>> It is easier to write API against Druid that Hbase? You still want a UI
>>> dashboard?
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 30 August 2016 at 03:19, Chanh Le <giaosudau@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Seems a lot people using Druid for realtime Dashboard.
>>>> I’m just wondering of using Druid for main storage engine because Druid
>>>> can store the raw data and can integrate with Spark also (theoretical).
>>>> In that case do we need to store 2 separate storage Druid (store
>>>> segment in HDFS) and HDFS?.
>>>> BTW did anyone try this one https://github.com/Sparkli
>>>> neData/spark-druid-olap?
>>>>
>>>>
>>>> Regards,
>>>> Chanh
>>>>
>>>>
>>>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Bhaarat and everyone.
>>>>
>>>> This is an updated version of the same diagram
>>>>
>>>> <LambdaArchitecture.png>
>>>> ​​​
>>>> The frequency of Recent data is defined by the Windows length in Spark
>>>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
>>>> move any Spark granularity below 0.5 seconds in anger. For some
>>>> applications like Credit card transactions and fraud detection. Data is
>>>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
>>>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>>>> tables.
>>>> One school of thought is never write to Hive from Spark, write
>>>>  straight to Hbase and then read Hbase tables into Hive periodically?
>>>>
>>>> Now the third component in this layer is Serving Layer that can combine
>>>> data from the current (Hbase) and the historical (Hive tables) to give the
>>>> user visual analytics. Now that visual analytics can be Real time dashboard
>>>> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
>>>> offering or Data from Hbase (Red Box) combined with Hive tables.
>>>>
>>>> I am not aware of any industrial strength Real time Dashboard.  The
>>>> idea is that one uses such dashboard in real time. Dashboard in this sense
>>>> meaning a general purpose API to data store of some type like on Serving
>>>> layer to provide visual analytics real time on demand, combining real time
>>>> data and aggregate views. As usual the devil in the detail.
>>>>
>>>>
>>>>
>>>> Let me know your thoughts. Anyway this is first cut pattern.
>>>>
>>>> ​​
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 29 August 2016 at 18:53, Bhaarat Sharma <bhaarat.s@gmail.com> wrote:
>>>>
>>>>> Hi Mich
>>>>>
>>>>> This is really helpful. I'm trying to wrap my head around the last
>>>>> diagram you shared (the one with kafka). In this diagram spark streaming
is
>>>>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>>>>> Queries, Dashboards" annotation. Based on this diagram, will real time
>>>>> queries be running on Spark or HBase?
>>>>>
>>>>> PS: My intention was not to steer the conversation away from what
>>>>> Ashok asked but I found the diagrams shared by Mich very insightful.
>>>>>
>>>>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> In terms of positioning, Spark is really the first Big Data platform
>>>>>> to integrate batch, streaming and interactive computations in a unified
>>>>>> framework. What this boils down to is the fact that whichever way
one look
>>>>>> at it there is somewhere that Spark can make a contribution to. In
general,
>>>>>> there are few design patterns common to Big Data
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - *ETL & Batch*
>>>>>>
>>>>>> The first one is the most common one with Established tools like
>>>>>> Sqoop, Talend for ETL and HDFS for storage of some kind. Spark can
be used
>>>>>> as the execution engine for Hive at the storage level which  actually
>>>>>> makes it a true vendor independent (BTW, Impala and Tez and LLAP)
are
>>>>>> offered by vendors) processing engine. Personally I use Spark at
ETL layer
>>>>>> by extracting data from sources through plug ins (JDBC and others)
and
>>>>>> storing in on HDFS in some kind
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - *Batch, real time plus Analytics*
>>>>>>
>>>>>> In this pattern you have data coming in real time and you want to
>>>>>> query them real time through real time dashboard. HDFS is not ideal
for
>>>>>> updating data in real time and neither for random access of data.
Source
>>>>>> could be all sorts of Web Servers and need Flume Agent with Flume.
At the
>>>>>> storage layer we are probably looking at something like Hbase. The
crucial
>>>>>> point being that saved data needs to be ready for queries immediately
The
>>>>>> dashboards requires Hbase APIs. The Analytics can be done through
Hive
>>>>>> again running on Spark engine. Again note here that we ideally should
>>>>>> process batch and real time separately.
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - *Real time / Streaming*
>>>>>>
>>>>>> This is most relevant to Spark as we are moving to near real time.
>>>>>> Where Spark excels. We need to capture the incoming events (logs,
sensor
>>>>>> data, pricing, emails) through interfaces like Kafka, Message Queues
etc.
>>>>>>  Need to process these events with minimum latency. Again Spark is
a
>>>>>> very good candidate here with its Spark Streaming and micro-batching
>>>>>> capabilities. There are others like Storm, Flink etc. that are event
based
>>>>>> but you don’t hear much. Again for streaming architecture you need
to sync
>>>>>> data in real time using something like Hbase, Cassandra (?) and others
as
>>>>>> real time store or forever storage HDFS or Hive etc.
>>>>>>
>>>>>>
>>>>>>             In general there is also *Lambda Architecture* that is
>>>>>> designed for streaming analytics. The streaming data ends up in both
batch
>>>>>> layer and speed layer. Batch layer is used to answer batch queries.
On the
>>>>>> other hand speed later is used ti handle fast/real time queries.
This model
>>>>>> is really cool as Spark Streaming can feed both the batch layer and
>>>>>> the speed layer.
>>>>>>
>>>>>>
>>>>>> At a high level this looks like this, from
>>>>>> http://lambda-architecture.net/
>>>>>>
>>>>>> <image.png>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> My favourite would be something like below with Spark playing a major
>>>>>> role
>>>>>>
>>>>>>
>>>>>> <LambdaArchitecture.png>
>>>>>> ​
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property
which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary
damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 28 August 2016 at 19:43, Sivakumaran S <siva.kumaran@me.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Spark best fits for processing. But depending on the use case,
you
>>>>>>> could expand the scope of Spark to moving data using the native
connectors.
>>>>>>> The only that Spark is not, is Storage. Connectors are available
for most
>>>>>>> storage options though.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Sivakumaran S
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <
>>>>>>> ashok34668@yahoo.com.INVALID <ashok34668@yahoo.com.invalid>>
wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> There are design patterns that use Spark extensively. I am new
to
>>>>>>> this area so I would appreciate if someone explains where Spark
fits in
>>>>>>> especially within faster or streaming use case.
>>>>>>>
>>>>>>> What are the best practices involving Spark. Is it always best
to
>>>>>>> deploy it for processing engine,
>>>>>>>
>>>>>>> For example when we have a pattern
>>>>>>>
>>>>>>> Input Data -> Data in Motion -> Processing -> Storage
>>>>>>>
>>>>>>> Where does Spark best fit in.
>>>>>>>
>>>>>>> Thanking you
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message