spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Nist <tsind...@gmail.com>
Subject Re: Design patterns involving Spark
Date Tue, 30 Aug 2016 12:16:53 GMT
Have not tried this, but looks quite useful if one is using Druid:

https://github.com/implydata/pivot  - An interactive data exploration UI
for Druid

On Tue, Aug 30, 2016 at 4:10 AM, Alonso Isidoro Roman <alonsoir@gmail.com>
wrote:

> Thanks Mitch, i will check it.
>
> Cheers
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>
> 2016-08-30 9:52 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>
>> You can use Hbase for building real time dashboards
>>
>> Check this link
>> <https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 30 August 2016 at 08:33, Alonso Isidoro Roman <alonsoir@gmail.com>
>> wrote:
>>
>>> HBase for real time queries? HBase was designed with the batch in mind.
>>> Impala should be a best choice, but i do not know what Druid can do....
>>>
>>>
>>> Cheers
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>>
>>> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>>>
>>>> Hi Chanh,
>>>>
>>>> Druid sounds like a good choice.
>>>>
>>>> But again the point being is that what else Druid brings on top of
>>>> Hbase.
>>>>
>>>> Unless one decides to use Druid for both historical data and real time
>>>> data in place of Hbase!
>>>>
>>>> It is easier to write API against Druid that Hbase? You still want a UI
>>>> dashboard?
>>>>
>>>> Cheers
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 30 August 2016 at 03:19, Chanh Le <giaosudau@gmail.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Seems a lot people using Druid for realtime Dashboard.
>>>>> I’m just wondering of using Druid for main storage engine because
>>>>> Druid can store the raw data and can integrate with Spark also
>>>>> (theoretical).
>>>>> In that case do we need to store 2 separate storage Druid (store
>>>>> segment in HDFS) and HDFS?.
>>>>> BTW did anyone try this one https://github.com/Sparkli
>>>>> neData/spark-druid-olap?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Chanh
>>>>>
>>>>>
>>>>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>> Thanks Bhaarat and everyone.
>>>>>
>>>>> This is an updated version of the same diagram
>>>>>
>>>>> <LambdaArchitecture.png>
>>>>> ​​​
>>>>> The frequency of Recent data is defined by the Windows length in Spark
>>>>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think
we can
>>>>> move any Spark granularity below 0.5 seconds in anger. For some
>>>>> applications like Credit card transactions and fraud detection. Data
is
>>>>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS
as
>>>>> well. The same Spark Streaming will write asynchronously to HDFS Hive
>>>>> tables.
>>>>> One school of thought is never write to Hive from Spark, write
>>>>>  straight to Hbase and then read Hbase tables into Hive periodically?
>>>>>
>>>>> Now the third component in this layer is Serving Layer that can
>>>>> combine data from the current (Hbase) and the historical (Hive tables)
to
>>>>> give the user visual analytics. Now that visual analytics can be Real
time
>>>>> dashboard on top of Serving Layer. That Serving layer could be an in-memory
>>>>> NoSQL offering or Data from Hbase (Red Box) combined with Hive tables.
>>>>>
>>>>> I am not aware of any industrial strength Real time Dashboard.  The
>>>>> idea is that one uses such dashboard in real time. Dashboard in this
sense
>>>>> meaning a general purpose API to data store of some type like on Serving
>>>>> layer to provide visual analytics real time on demand, combining real
time
>>>>> data and aggregate views. As usual the devil in the detail.
>>>>>
>>>>>
>>>>>
>>>>> Let me know your thoughts. Anyway this is first cut pattern.
>>>>>
>>>>> ​​
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 29 August 2016 at 18:53, Bhaarat Sharma <bhaarat.s@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Mich
>>>>>>
>>>>>> This is really helpful. I'm trying to wrap my head around the last
>>>>>> diagram you shared (the one with kafka). In this diagram spark streaming
is
>>>>>> pushing data to HDFS and NoSql. However, I'm confused by the "Real
Time
>>>>>> Queries, Dashboards" annotation. Based on this diagram, will real
time
>>>>>> queries be running on Spark or HBase?
>>>>>>
>>>>>> PS: My intention was not to steer the conversation away from what
>>>>>> Ashok asked but I found the diagrams shared by Mich very insightful.
>>>>>>
>>>>>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> In terms of positioning, Spark is really the first Big Data platform
>>>>>>> to integrate batch, streaming and interactive computations in
a unified
>>>>>>> framework. What this boils down to is the fact that whichever
way one look
>>>>>>> at it there is somewhere that Spark can make a contribution to.
In general,
>>>>>>> there are few design patterns common to Big Data
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - *ETL & Batch*
>>>>>>>
>>>>>>> The first one is the most common one with Established tools like
>>>>>>> Sqoop, Talend for ETL and HDFS for storage of some kind. Spark
can be used
>>>>>>> as the execution engine for Hive at the storage level which 
actually
>>>>>>> makes it a true vendor independent (BTW, Impala and Tez and LLAP)
are
>>>>>>> offered by vendors) processing engine. Personally I use Spark
at ETL layer
>>>>>>> by extracting data from sources through plug ins (JDBC and others)
and
>>>>>>> storing in on HDFS in some kind
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - *Batch, real time plus Analytics*
>>>>>>>
>>>>>>> In this pattern you have data coming in real time and you want
to
>>>>>>> query them real time through real time dashboard. HDFS is not
ideal for
>>>>>>> updating data in real time and neither for random access of data.
Source
>>>>>>> could be all sorts of Web Servers and need Flume Agent with Flume.
At the
>>>>>>> storage layer we are probably looking at something like Hbase.
The crucial
>>>>>>> point being that saved data needs to be ready for queries immediately
The
>>>>>>> dashboards requires Hbase APIs. The Analytics can be done through
Hive
>>>>>>> again running on Spark engine. Again note here that we ideally
should
>>>>>>> process batch and real time separately.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - *Real time / Streaming*
>>>>>>>
>>>>>>> This is most relevant to Spark as we are moving to near real
time.
>>>>>>> Where Spark excels. We need to capture the incoming events (logs,
sensor
>>>>>>> data, pricing, emails) through interfaces like Kafka, Message
Queues etc.
>>>>>>>  Need to process these events with minimum latency. Again Spark
is
>>>>>>> a very good candidate here with its Spark Streaming and micro-batching
>>>>>>> capabilities. There are others like Storm, Flink etc. that are
event based
>>>>>>> but you don’t hear much. Again for streaming architecture you
need to sync
>>>>>>> data in real time using something like Hbase, Cassandra (?) and
others as
>>>>>>> real time store or forever storage HDFS or Hive etc.
>>>>>>>
>>>>>>>
>>>>>>>             In general there is also *Lambda Architecture* that
is
>>>>>>> designed for streaming analytics. The streaming data ends up
in both batch
>>>>>>> layer and speed layer. Batch layer is used to answer batch queries.
On the
>>>>>>> other hand speed later is used ti handle fast/real time queries.
This model
>>>>>>> is really cool as Spark Streaming can feed both the batch layer
and
>>>>>>> the speed layer.
>>>>>>>
>>>>>>>
>>>>>>> At a high level this looks like this, from
>>>>>>> http://lambda-architecture.net/
>>>>>>>
>>>>>>> <image.png>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My favourite would be something like below with Spark playing
a
>>>>>>> major role
>>>>>>>
>>>>>>>
>>>>>>> <LambdaArchitecture.png>
>>>>>>> ​
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property
which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary
damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 28 August 2016 at 19:43, Sivakumaran S <siva.kumaran@me.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Spark best fits for processing. But depending on the use
case, you
>>>>>>>> could expand the scope of Spark to moving data using the
native connectors.
>>>>>>>> The only that Spark is not, is Storage. Connectors are available
for most
>>>>>>>> storage options though.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Sivakumaran S
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <
>>>>>>>> ashok34668@yahoo.com.INVALID <ashok34668@yahoo.com.invalid>>
wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> There are design patterns that use Spark extensively. I am
new to
>>>>>>>> this area so I would appreciate if someone explains where
Spark fits in
>>>>>>>> especially within faster or streaming use case.
>>>>>>>>
>>>>>>>> What are the best practices involving Spark. Is it always
best to
>>>>>>>> deploy it for processing engine,
>>>>>>>>
>>>>>>>> For example when we have a pattern
>>>>>>>>
>>>>>>>> Input Data -> Data in Motion -> Processing -> Storage
>>>>>>>>
>>>>>>>> Where does Spark best fit in.
>>>>>>>>
>>>>>>>> Thanking you
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message