spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Spark based Data Warehouse
Date Sun, 12 Nov 2017 09:14:34 GMT
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.

We do not know the requirements of your data scientists now or in the future so it is difficult
to say. How do they work currently without the new solution? Do they all work on the same
data? I bet you will receive on your email a lot of private messages trying to sell their
solution that solves everything - with the information you provided this is impossible to

Then with every system: have incremental releases but have then in short time frames - do
not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect
possibility to scale feature but also infrastructure wise.

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler)
that not one query take all the resources or that long running queries starve.

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your
data scientists. Some may know or prefer other tools.

> On 12. Nov 2017, at 08:32, Deepak Sharma <> wrote:
> I am looking for similar solution more aligned to data scientist group.
> The concern i have is about supporting complex aggregations at runtime .
> Thanks
> Deepak
>> On Nov 12, 2017 12:51, "ashish rawat" <> wrote:
>> Hello Everyone,
>> I was trying to understand if anyone here has tried a data warehouse solution using
S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning
to go with Spark SQL, for our aggregates and processing requirements.
>> If anyone has tried it out, would like to understand the following:
>> Is Spark SQL and UDF, able to handle all the workloads?
>> What user interface did you provide for data scientist, data engineers and analysts
>> What are the challenges in running concurrent queries, by many users, over Spark
SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there
frequent query failures when executing concurrent queries
>> Are there any open source implementations, which provide something similar?
>> Regards,
>> Ashish

View raw message