spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vikram Kone <vikramk...@gmail.com>
Subject Re: Need advice for Spark newbie
Date Thu, 26 Feb 2015 21:23:30 GMT
Dean
Thanks for the info. Are you saying that we can create star/snowflake data
models using spark so they can be queried from tableau ?

On Thursday, February 26, 2015, Dean Wampler <deanwampler@gmail.com> wrote:

> Historically, many orgs. have replaced data warehouses with Hadoop
> clusters and used Hive along with Impala (on Cloudera deployments) or Drill
> (on MapR deployments) for SQL. Hive is older and slower, while Impala and
> Drill are newer and faster, but you typically need both for their
> complementary features, at least today.
>
> Spark and Spark SQL are not yet complete replacements for them, but
> they'll get there over time. The good news is, you can mix and match these
> tools, as appropriate, because they can all work with the same datasets.
>
> The challenge is all the tribal knowledge required to setup and manage
> Hadoop clusters, to properly organize your data for best performance for
> your needs, to use all these tools effectively, along with additional
> Hadoop ETL tools, etc. Fortunately, tools like Tableau are already
> integrated here.
>
> However, none of this will be as polished and integrated as what you're
> used to. You're trading that polish for greater scalability and flexibility.
>
> HTH.
>
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone <vikramkone@gmail.com
> <javascript:_e(%7B%7D,'cvml','vikramkone@gmail.com');>> wrote:
>
>> Hi,
>> I'm a newbie when it comes to Spark and Hadoop eco system in general. Our
>> team has been predominantly a Microsoft shop that uses MS stack for most
>> of
>> their BI needs. So we are talking SQL server  for storing relational data
>> and SQL Server Analysis services for building MOLAP cubes for sub-second
>> query analysis.
>> Lately, we have been hitting degradation in our cube query response times
>> as our data sizes grew considerably the past year. We are talking fact
>> tables which are in 1o-100 billions of rows range and a few dimensions in
>> the 10-100's of millions of rows. We tried vertically scaling up our SSAS
>> server but queries are still taking few minutes. In light of this, I was
>> entrusted with task of figuring out an open source solution that would
>> scale to our current and future needs for data analysis.
>> I looked at a bunch of open source tools like Apache Drill, Druid,
>> AtScale,
>> Spark, Storm, Kylin etc and settled on exploring Spark as the first step
>> given it's recent rise in popularity and growing eco-system around it.
>> Since we are also interested in doing deep data analysis like machine
>> learning and graph algorithms on top our data, spark seems to be a good
>> solution.
>> I would like to build out a POC for our MOLAP cubes using spark with
>> HDFS/Hive as the datasource and see how it scales for our queries/measures
>> in real time with real data.
>> Roughly, these are the requirements for our team
>> 1. Should be able to create facts, dimensions and measures from our data
>> sets in an easier way.
>> 2. Cubes should be query able from Excel and Tableau.
>> 3. Easily scale out by adding new nodes when data grows
>> 4. Very less maintenance and highly stable for production level workloads
>> 5. Sub second query latencies for COUNT DISTINCT measures (since majority
>> of our expensive measures are of this type) . Are ok with Approx Distinct
>> counts for better perf.
>>
>> So given these requirements, is Spark the right solution to replace our
>> on-premise MOLAP cubes?
>> Are there any tutorials or documentation on how to build cubes using
>> Spark?
>> Is that even possible? or even necessary? As long as our users can
>> pivot/slice & dice the measures quickly from client tools by dragging
>> dropping dimensions into rows/columns w/o the need to join to fact table,
>> we are ok with however the data is laid out. Doesn't have to be a cube. It
>> can be a flat file in hdfs for all we care. I would love to chat with some
>> one who has successfully done this kind of migration from OLAP cubes to
>> Spark in their team or company .
>>
>> This is it for now. Looking forward to a great discussion.
>>
>> P.S. We have decided on using Azure HDInsight as our managed hadoop system
>> in the cloud.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message