spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrés Ivaldi <iaiva...@gmail.com>
Subject Re: Can we use spark inside a web service?
Date Fri, 11 Mar 2016 13:02:16 GMT
nice discussion , I've a question about  Web Service with Spark.

What Could be the problem using Akka-http as web service (Like play does )
, with one SparkContext created , and the queries over -http akka using
only the instance of  that SparkContext ,

Also about Analytics , we are working on real- time Analytics and as Hemant
said Spark is not a solution for low latency queries. What about using
Ingite for that?


On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat <hemant9379@gmail.com>
wrote:

> Spark-jobserver is an elegant product that builds concurrency on top of
> Spark. But, the current design of DAGScheduler prevents Spark to become a
> truly concurrent solution for low latency queries. DagScheduler will turn
> out to be a bottleneck for low latency queries. Sparrow project was an
> effort to make Spark more suitable for such scenarios but it never made it
> to the Spark codebase. If Spark has to become a highly concurrent solution,
> scheduling has to be distributed.
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
> www.snappydata.io
>
> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <chris@fregly.com> wrote:
>
>> great discussion, indeed.
>>
>> Mark Hamstra and i spoke offline just now.
>>
>> Below is a quick recap of our discussion on how they've achieved
>> acceptable performance from Spark on the user request/response path (@mark-
>> feel free to correct/comment).
>>
>> 1) there is a big difference in request/response latency between
>> submitting a full Spark Application (heavy weight) versus having a
>> long-running Spark Application (like Spark Job Server) that submits
>> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
>> the latter - a long-running Spark App.
>>
>> 2) there are some enhancements to Spark that are required to achieve
>> acceptable user request/response times.  some links that Mark provided are
>> as follows:
>>
>>    - https://issues.apache.org/jira/browse/SPARK-11838
>>    - https://github.com/apache/spark/pull/11036
>>    - https://github.com/apache/spark/pull/11403
>>    - https://issues.apache.org/jira/browse/SPARK-13523
>>    - https://issues.apache.org/jira/browse/SPARK-13756
>>
>> Essentially, a deeper level of caching at the shuffle file layer to
>> reduce compute and memory between queries.
>>
>> Note that Mark is running a slightly-modified version of stock Spark.
>>  (He's mentioned this in prior posts, as well.)
>>
>> And I have to say that I'm, personally, seeing more and more
>> slightly-modified versions of Spark being deployed to production to
>> workaround outstanding PR's and Jiras.
>>
>> this may not be what people want to hear, but it's a trend that i'm
>> seeing lately as more and more customize Spark to their specific use cases.
>>
>> Anyway, thanks for the good discussion, everyone!  This is why we have
>> these lists, right!  :)
>>
>>
>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <velvia.github@gmail.com>
>> wrote:
>>
>>> One of the premises here is that if you can restrict your workload to
>>> fewer cores - which is easier with FiloDB and careful data modeling -
>>> you can make this work for much higher concurrency and lower latency
>>> than most typical Spark use cases.
>>>
>>> The reason why it typically does not work in production is that most
>>> people are using HDFS and files.  These data sources are designed for
>>> running queries and workloads on all your cores across many workers,
>>> and not for filtering your workload down to only one or two cores.
>>>
>>> There is actually nothing inherent in Spark that prevents people from
>>> using it as an app server.   However, the insistence on using it with
>>> HDFS is what kills concurrency.   This is why FiloDB is important.
>>>
>>> I agree there are more optimized stacks for running app servers, but
>>> the choices that you mentioned:  ES is targeted at text search;  Cass
>>> and HBase by themselves are not fast enough for analytical queries
>>> that the OP wants;  and MySQL is great but not scalable.   Probably
>>> something like VectorWise, HANA, Vertica would work well, but those
>>> are mostly not free solutions.   Druid could work too if the use case
>>> is right.
>>>
>>> Anyways, great discussion!
>>>
>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <chris@fregly.com> wrote:
>>> > you are correct, mark.  i misspoke.  apologies for the confusion.
>>> >
>>> > so the problem is even worse given that a typical job requires multiple
>>> > tasks/cores.
>>> >
>>> > i have yet to see this particular architecture work in production.  i
>>> would
>>> > love for someone to prove otherwise.
>>> >
>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra <mark@clearstorydata.com
>>> >
>>> > wrote:
>>> >>>
>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>> requests,
>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>> with 1000
>>> >>> cores.
>>> >>
>>> >>
>>> >> This doesn't make sense.  A Spark Job is a driver/DAGScheduler concept
>>> >> without any 1:1 correspondence between Worker cores and Jobs.  Cores
>>> are
>>> >> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at
>>> most
>>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything
>>> about how
>>> >> many Jobs are or can be concurrently tracked by the DAGScheduler,
>>> which will
>>> >> be apportioning the Tasks from those concurrent Jobs across the
>>> available
>>> >> Executor cores.
>>> >>
>>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <chris@fregly.com>
>>> wrote:
>>> >>>
>>> >>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>>> >>> capabilities of FiloDB which is pretty cool.  looking forward to
the
>>> webcast
>>> >>> as I don't know much about FiloDB.
>>> >>>
>>> >>> My personal thoughts here are to removed Spark from the user
>>> >>> request/response hot path.
>>> >>>
>>> >>> I can't tell you how many times i've had to unroll that architecture
>>> at
>>> >>> clients - and replace with a real database like Cassandra,
>>> ElasticSearch,
>>> >>> HBase, MySql.
>>> >>>
>>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you
to
>>> >>> believe that Spark could be used as an application server.  This
is
>>> not a
>>> >>> good use case for Spark.
>>> >>>
>>> >>> Remember that every job that is launched by Spark requires 1 CPU
>>> core,
>>> >>> some memory, and an available Executor JVM to provide the CPU and
>>> memory.
>>> >>>
>>> >>> Yes, you can horizontally scale this because of the distributed
>>> nature of
>>> >>> Spark, however it is not an efficient scaling strategy.
>>> >>>
>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>> requests,
>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>> with 1000
>>> >>> cores.  this is just not cost effective.
>>> >>>
>>> >>> Use Spark for what it's good for - ad-hoc, interactive, and iterative
>>> >>> (machine learning, graph) analytics.  Use an application server
for
>>> what
>>> >>> it's good - managing a large amount of concurrent requests.  And
use
>>> a
>>> >>> database for what it's good for - storing/retrieving data.
>>> >>>
>>> >>> And any serious production deployment will need failover, throttling,
>>> >>> back pressure, auto-scaling, and service discovery.
>>> >>>
>>> >>> While Spark supports these to varying levels of production-readiness,
>>> >>> Spark is a batch-oriented system and not meant to be put on the
user
>>> >>> request/response hot path.
>>> >>>
>>> >>> For the failover, throttling, back pressure, autoscaling that i
>>> mentioned
>>> >>> above, it's worth checking out the suite of Netflix OSS -
>>> particularly
>>> >>> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>>> >>>
>>> >>> Here's my github project that incorporates a lot of these:
>>> >>> https://github.com/cfregly/fluxcapacitor
>>> >>>
>>> >>> Here's a netflix Skunkworks github project that packages these up
in
>>> >>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>> >>>
>>> >>>
>>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github <
>>> velvia.github@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> I just wrote a blog post which might be really useful to you
-- I
>>> have
>>> >>>> just
>>> >>>> benchmarked being able to achieve 700 queries per second in
Spark.
>>> So,
>>> >>>> yes,
>>> >>>> web speed SQL queries are definitely possible.   Read my new
blog
>>> post:
>>> >>>>
>>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>> >>>>
>>> >>>> and feel free to email me (at velvia@gmail.com) if you would
like
>>> to
>>> >>>> follow
>>> >>>> up.
>>> >>>>
>>> >>>> -Evan
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> View this message in context:
>>> >>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>>> >>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >>>>
>>> >>>>
>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> >>>> For additional commands, e-mail: user-help@spark.apache.org
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>>
>>> >>> Chris Fregly
>>> >>> Principal Data Solutions Engineer
>>> >>> IBM Spark Technology Center, San Francisco, CA
>>> >>> http://spark.tc | http://advancedspark.com
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > Chris Fregly
>>> > Principal Data Solutions Engineer
>>> > IBM Spark Technology Center, San Francisco, CA
>>> > http://spark.tc | http://advancedspark.com
>>>
>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>
>


-- 
Ing. Ivaldi Andres

Mime
View raw message