spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <>
Subject Re: Spark based Data Warehouse
Date Sun, 12 Nov 2017 18:47:28 GMT
Dear Ashish,
what you are asking for involves at least a few weeks of dedicated
understanding of your used case and then it takes at least 3 to 4 months to
even propose a solution. You can even build a fantastic data warehouse just
using C++. The matter depends on lots of conditions. I just think that your
approach and question needs a lot of modification.


On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <>

> Hi, Ashish.
> You are correct in saying that not *all* functionality of Spark is
> spill-to-disk but I am not sure how this pertains to a "concurrent user
> scenario". Each executor will run in its own JVM and is therefore isolated
> from others. That is, if the JVM of one user dies, this should not effect
> another user who is running their own jobs in their own JVMs. The amount of
> resources used by a user can be controlled by the resource manager.
> AFAIK, you configure something like YARN to limit the number of cores and
> the amount of memory in the cluster a certain user or group is allowed to
> use for their job. This is obviously quite a coarse-grained approach as (to
> my knowledge) IO is not throttled. I believe people generally use something
> like Apache Ambari to keep an eye on network and disk usage to mitigate
> problems in a shared cluster.
> If the user has badly designed their query, it may very well fail with
> OOMEs but this can happen irrespective of whether one user or many is using
> the cluster at a given moment in time.
> Does this help?
> Regards,
> Phillip
> On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <> wrote:
>> Thanks Jorn and Phillip. My question was specifically to anyone who have
>> tried creating a system using spark SQL, as Data Warehouse. I was trying to
>> check, if someone has tried it and they can help with the kind of workloads
>> which worked and the ones, which have problems.
>> Regarding spill to disk, I might be wrong but not all functionality of
>> spark is spill to disk. So it still doesn't provide DB like reliability in
>> execution. In case of DBs, queries get slow but they don't fail or go out
>> of memory, specifically in concurrent user scenarios.
>> Regards,
>> Ashish
>> On Nov 12, 2017 3:02 PM, "Phillip Henry" <> wrote:
>> Agree with Jorn. The answer is: it depends.
>> In the past, I've worked with data scientists who are happy to use the
>> Spark CLI. Again, the answer is "it depends" (in this case, on the skills
>> of your customers).
>> Regarding sharing resources, different teams were limited to their own
>> queue so they could not hog all the resources. However, people within a
>> team had to do some horse trading if they had a particularly intensive job
>> to run. I did feel that this was an area that could be improved. It may be
>> by now, I've just not looked into it for a while.
>> BTW I'm not sure what you mean by "Spark still does not provide spill to
>> disk" as the FAQ says "Spark's operators spill data to disk if it does not
>> fit in memory" ( So, your data will
>> not normally cause OutOfMemoryErrors (certain terms and conditions may
>> apply).
>> My 2 cents.
>> Phillip
>> On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <>
>> wrote:
>>> What do you mean all possible workloads?
>>> You cannot prepare any system to do all possible processing.
>>> We do not know the requirements of your data scientists now or in the
>>> future so it is difficult to say. How do they work currently without the
>>> new solution? Do they all work on the same data? I bet you will receive on
>>> your email a lot of private messages trying to sell their solution that
>>> solves everything - with the information you provided this is impossible to
>>> say.
>>> Then with every system: have incremental releases but have then in short
>>> time frames - do not engineer a big system that you will deliver in 2
>>> years. In the cloud you have the perfect possibility to scale feature but
>>> also infrastructure wise.
>>> Challenges with concurrent queries is the right definition of the
>>> scheduler (eg fairscheduler) that not one query take all the resources or
>>> that long running queries starve.
>>> User interfaces: what could help are notebooks (Jupyter etc) but you may
>>> need to train your data scientists. Some may know or prefer other tools.
>>> On 12. Nov 2017, at 08:32, Deepak Sharma <> wrote:
>>> I am looking for similar solution more aligned to data scientist group.
>>> The concern i have is about supporting complex aggregations at runtime .
>>> Thanks
>>> Deepak
>>> On Nov 12, 2017 12:51, "ashish rawat" <> wrote:
>>>> Hello Everyone,
>>>> I was trying to understand if anyone here has tried a data warehouse
>>>> solution using S3 and Spark SQL. Out of multiple possible options
>>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>>>> our aggregates and processing requirements.
>>>> If anyone has tried it out, would like to understand the following:
>>>>    1. Is Spark SQL and UDF, able to handle all the workloads?
>>>>    2. What user interface did you provide for data scientist, data
>>>>    engineers and analysts
>>>>    3. What are the challenges in running concurrent queries, by many
>>>>    users, over Spark SQL? Considering Spark still does not provide spill
>>>>    disk, in many scenarios, are there frequent query failures when executing
>>>>    concurrent queries
>>>>    4. Are there any open source implementations, which provide
>>>>    something similar?
>>>> Regards,
>>>> Ashish

View raw message