spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: using spark to load a data warehouse in real time
Date Wed, 01 Mar 2017 07:25:27 GMT
I am not sure that Spark Streaming is what you want to do. It is for streaming analytics not
for loading in a DWH. 

You need also define what realtime means and what is needed there - it will differ from client
to client significantly.

From my experience, just SQL is not enough for the users in the future. Especially large data
volumes require much more beyond just aggregations. These may become less useful in context
of large data volumes. They have to learn new ways of dealing with the data from a business
perspective by employing proper sampling of data from a large dataset, machine learning approaches
etc. These are new methods which are not technically driven but business driven. I think it
is wrong to assume that users learning new skills is a bad thing; it might be in the future
a necessity.

> On 28 Feb 2017, at 23:18, Adaryl Wakefield <adaryl.wakefield@hotmail.com> wrote:
> 
> I’m actually trying to come up with a generalized use case that I can take from client
to client. We have structured data coming from some application. Instead of dropping it into
Hadoop and then using yet another technology to query that data, I just want to dump it into
a relational MPP DW so nobody has to learn new skills or new tech just to do some analysis.
Everybody and their mom can write SQL. Designing relational databases is a rare skill but
not as rare as what is necessary for designing some NoSQL solutions.
>  
> I’m looking for the fastest path to move a company from batch to real time analytical
processing.
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: Mohammad Tariq [mailto:dontariq@gmail.com] 
> Sent: Tuesday, February 28, 2017 12:57 PM
> To: Adaryl Wakefield <adaryl.wakefield@hotmail.com>
> Cc: user@spark.apache.org
> Subject: Re: using spark to load a data warehouse in real time
>  
> Hi Adaryl,
>  
> You could definitely load data into a warehouse through Spark's JDBC support through
DataFrames. Could you please explain your use case a bit more? That'll help us in answering
your query better.
>  
>  
>  
> 
>  
> Tariq, Mohammad
> about.me/mti
> 
>  
> 
>  
> Tariq, Mohammad
> about.me/mti
> 
>  
>  
>  
> On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <adaryl.wakefield@hotmail.com>
wrote:
> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course
have to be in any architecture but it looks like they are suggesting that Kafka is all you
need.
>  
> My primary concern is the complexity of loading warehouses. I have a web development
background so I have somewhat of an idea on how to insert data into a database from an application.
I’ve since moved on to straight database programming and don’t work with anything that
reads from an app anymore.
>  
> Loading a warehouse requires a lot of cleaning of data and running and grabbing keys
to maintain referential integrity. Usually that’s done in a batch process. Now I have to
do it record by record (or a few records). I have some ideas but I’m not quite there yet.
>  
> I thought SparkSQL would be the way to get this done but so far, all the examples I’ve
seen are just SELECT statements, no INSERTS or MERGE statements.
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: Femi Anthony [mailto:femibyte@gmail.com] 
> Sent: Tuesday, February 28, 2017 4:13 AM
> To: Adaryl Wakefield <adaryl.wakefield@hotmail.com>
> Cc: user@spark.apache.org
> Subject: Re: using spark to load a data warehouse in real time
>  
> Have you checked to see if there are any drivers to enable you to write to Greenplum
directly from Spark ?
>  
> You can also take a look at this link:
>  
> https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q
>  
> Apparently GPDB is based on Postgres so maybe that approach may work. 
> Another approach maybe for Spark Streaming to write to Kafka, and then have another process
read from Kafka and write to Greenplum.
>  
> Kafka Connect may be useful in this case -
>  
> https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
>  
> Femi Anthony
>  
>  
> 
> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <adaryl.wakefield@hotmail.com> wrote:
> 
> Is anybody using Spark streaming/SQL to load a relational data warehouse in real time?
There isn’t a lot of information on this use case out there. When I google real time data
warehouse load, nothing I find is up to date. It’s all turn of the century stuff and doesn’t
take into account advancements in database technology. Additionally, whenever I try to learn
spark, it’s always the same thing. Play with twitter data never structured data. All the
CEP uses cases are about data science.
>  
> I’d like to use Spark to load Greenplumb in real time. Intuitively, this should be
possible. I was thinking Spark Streaming with Spark SQL along with a ORM should do it. Am
I off base with this? Is the reason why there are no examples is because there is a better
way to do what I want?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> 
>  

Mime
View raw message