spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Spark Improvement Proposals
Date Tue, 03 Jan 2017 23:16:03 GMT
I don't have a concern about voting vs consensus.

I have a concern that whatever the decision making process is, it is
explicitly announced on the ticket for the given proposal, with an explicit
deadline, and an explicit outcome.


On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <irashid@cloudera.com> wrote:

> I'm also in favor of this.  Thanks for your persistence Cody.
>
> My take on the specific issues Joseph mentioned:
>
> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
> earlier for consensus:
>
> > Majority vs consensus: My rationale is that I don't think we want to
> consider a proposal approved if it had objections serious enough that
> committers down-voted (or PMC depending on who gets a vote). If these
> proposals are like PEPs, then they represent a significant amount of
> community effort and I wouldn't want to move forward if up to half of the
> community thinks it's an untenable idea.
>
> 2) Design doc template -- agree this would be useful, but also seems
> totally orthogonal to moving forward on the SIP proposal.
>
> 3) agree w/ Joseph's proposal for updating the template.
>
> One small addition:
>
> 4) Deciding on a name -- minor, but I think its wroth disambiguating from
> Scala's SIPs, and the best proposal I've heard is "SPIP".   At least, no
> one has objected.  (don't care enough that I'd object to anything else,
> though.)
>
>
> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <joseph@databricks.com>
> wrote:
>
>> Hi Cody,
>>
>> Thanks for being persistent about this.  I too would like to see this
>> happen.  Reviewing the thread, it sounds like the main things remaining are:
>> * Decide about a few issues
>> * Finalize the doc(s)
>> * Vote on this proposal
>>
>> Issues & TODOs:
>>
>> (1) The main issue I see above is voting vs. consensus.  I have little
>> preference here.  It sounds like something which could be tailored based on
>> whether we see too many or too few SIPs being approved.
>>
>> (2) Design doc template  (This would be great to have for Spark
>> regardless of this SIP discussion.)
>> * Reynold, are you still putting this together?
>>
>> (3) Template cleanups.  Listing some items mentioned above + a new one
>> w.r.t. Reynold's draft
>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>> :
>> * Reinstate the "Where" section with links to current and past SIPs
>> * Add field for stating explicit deadlines for approval
>> * Add field for stating Author & Committer shepherd
>>
>> Thanks all!
>> Joseph
>>
>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <cody@koeninger.org>
>> wrote:
>>
>>> I'm bumping this one more time for the new year, and then I'm giving up.
>>>
>>> Please, fix your process, even if it isn't exactly the way I suggested.
>>>
>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rblue@netflix.com> wrote:
>>> > On lazy consensus as opposed to voting:
>>> >
>>> > First, why lazy consensus? The proposal was for consensus, which is at
>>> least
>>> > three +1 votes and no vetos. Consensus has no losing side, it requires
>>> > getting to a point where there is agreement. Isn't that agreement what
>>> we
>>> > want to achieve with these proposals?
>>> >
>>> > Second, lazy consensus only removes the requirement for three +1
>>> votes. Why
>>> > would we not want at least three committers to think something is a
>>> good
>>> > idea before adopting the proposal?
>>> >
>>> > rb
>>> >
>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <cody@koeninger.org>
>>> wrote:
>>> >>
>>> >> So there are some minor things (the Where section heading appears to
>>> >> be dropped; wherever this document is posted it needs to actually link
>>> >> to a jira filter showing current / past SIPs) but it doesn't look like
>>> >> I can comment on the google doc.
>>> >>
>>> >> The major substantive issue that I have is that this version is
>>> >> significantly less clear as to the outcome of an SIP.
>>> >>
>>> >> The apache example of lazy consensus at
>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>> >> explicit announcement of an explicit deadline, which I think are
>>> >> necessary for clarity.
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>> >> > It turned out suggested edits (trackable) don't show up for
>>> non-owners,
>>> >> > so
>>> >> > I've just merged all the edits in place. It should be visible now.
>>> >> >
>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rxin@databricks.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Oops. Let me try figure that out.
>>> >> >>
>>> >> >>
>>> >> >> On Monday, November 7, 2016, Cody Koeninger <cody@koeninger.org>
>>> wrote:
>>> >> >>>
>>> >> >>> Thanks for picking up on this.
>>> >> >>>
>>> >> >>> Maybe I fail at google docs, but I can't see any edits
on the
>>> document
>>> >> >>> you linked.
>>> >> >>>
>>> >> >>> Regarding lazy consensus, if the board in general has less
of an
>>> issue
>>> >> >>> with that, sure.  As long as it is clearly announced, lasts
at
>>> least
>>> >> >>> 72 hours, and has a clear outcome.
>>> >> >>>
>>> >> >>> The other points are hard to comment on without being able
to see
>>> the
>>> >> >>> text in question.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rxin@databricks.com>
>>> >> >>> wrote:
>>> >> >>> > I just looked through the entire thread again tonight
- there
>>> are a
>>> >> >>> > lot
>>> >> >>> > of
>>> >> >>> > great ideas being discussed. Thanks Cody for taking
the first
>>> crack
>>> >> >>> > at
>>> >> >>> > the
>>> >> >>> > proposal.
>>> >> >>> >
>>> >> >>> > I want to first comment on the context. Spark is one
of the most
>>> >> >>> > innovative
>>> >> >>> > and important projects in (big) data -- overall technical
>>> decisions
>>> >> >>> > made in
>>> >> >>> > Apache Spark are sound. But of course, a project as
large and
>>> active
>>> >> >>> > as
>>> >> >>> > Spark always have room for improvement, and we as
a community
>>> should
>>> >> >>> > strive
>>> >> >>> > to take it to the next level.
>>> >> >>> >
>>> >> >>> > To that end, the two biggest areas for improvements
in my
>>> opinion
>>> >> >>> > are:
>>> >> >>> >
>>> >> >>> > 1. Visibility: There are so much happening that it
is difficult
>>> to
>>> >> >>> > know
>>> >> >>> > what
>>> >> >>> > really is going on. For people that don't follow closely,
it is
>>> >> >>> > difficult to
>>> >> >>> > know what the important initiatives are. Even for
people that do
>>> >> >>> > follow, it
>>> >> >>> > is difficult to know what specific things require
their
>>> attention,
>>> >> >>> > since the
>>> >> >>> > number of pull requests and JIRA tickets are high
and it's
>>> difficult
>>> >> >>> > to
>>> >> >>> > extract signal from noise.
>>> >> >>> >
>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>> themselves)
>>> >> >>> > input
>>> >> >>> > more proactively: At the end of the day the project
provides
>>> value
>>> >> >>> > because
>>> >> >>> > users use it. Users can't tell us exactly what to
build, but it
>>> is
>>> >> >>> > important
>>> >> >>> > to get their inputs.
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > I've taken Cody's doc and edited it:
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>> >> >>> > (I've made all my modifications trackable)
>>> >> >>> >
>>> >> >>> > There are couple high level changes I made:
>>> >> >>> >
>>> >> >>> > 1. I've consulted a board member and he recommended
lazy
>>> consensus
>>> >> >>> > as
>>> >> >>> > opposed to voting. The reason being in voting there
can easily
>>> be a
>>> >> >>> > "loser'
>>> >> >>> > that gets outvoted.
>>> >> >>> >
>>> >> >>> > 2. I made it lighter weight, and renamed "strategy"
to "optional
>>> >> >>> > design
>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so
far aside
>>> from
>>> >> >>> > tagging
>>> >> >>> > things and linking them elsewhere simply having design
docs and
>>> >> >>> > prototypes
>>> >> >>> > implementations in PRs is not something that has not
worked so
>>> far".
>>> >> >>> >
>>> >> >>> > 3. I made some the language tweaks to focus more on
visibility.
>>> For
>>> >> >>> > example,
>>> >> >>> > "The purpose of an SIP is to inform and involve",
rather than
>>> just
>>> >> >>> > "involve". SIPs should also have at least two emails
that go to
>>> >> >>> > dev@.
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > While I was editing this, I thought we really needed
a suggested
>>> >> >>> > template
>>> >> >>> > for design doc too. I will get to that too ...
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>> rxin@databricks.com>
>>> >> >>> > wrote:
>>> >> >>> >>
>>> >> >>> >> Most things looked OK to me too, although I do
plan to take a
>>> >> >>> >> closer
>>> >> >>> >> look
>>> >> >>> >> after Nov 1st when we cut the release branch for
2.1.
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>> >> >>> >> <vanzin@cloudera.com>
>>> >> >>> >> wrote:
>>> >> >>> >>>
>>> >> >>> >>> The proposal looks OK to me. I assume, even
though it's not
>>> >> >>> >>> explicitly
>>> >> >>> >>> called, that voting would happen by e-mail?
A template for the
>>> >> >>> >>> proposal document (instead of just a bullet
nice) would also
>>> be
>>> >> >>> >>> nice,
>>> >> >>> >>> but that can be done at any time.
>>> >> >>> >>>
>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which
I consider a
>>> >> >>> >>> candidate
>>> >> >>> >>> for a SIP, given the scope of the work. The
document attached
>>> even
>>> >> >>> >>> somewhat matches the proposed format. So if
anyone wants to
>>> try
>>> >> >>> >>> out
>>> >> >>> >>> the process...
>>> >> >>> >>>
>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>> >> >>> >>> <cody@koeninger.org>
>>> >> >>> >>> wrote:
>>> >> >>> >>> > Now that spark summit europe is over,
are any committers
>>> >> >>> >>> > interested
>>> >> >>> >>> > in
>>> >> >>> >>> > moving forward with this?
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> > https://github.com/koeninger/s
>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >> >>> >>> >
>>> >> >>> >>> > Or are we going to let this discussion
die on the vine?
>>> >> >>> >>> >
>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz
Gawęda
>>> >> >>> >>> > <tomasz.gaweda@outlook.com> wrote:
>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> I didn't want to write "lets focus
on Flink" or any other
>>> >> >>> >>> >> framework.
>>> >> >>> >>> >> The
>>> >> >>> >>> >> idea with benchmarks was to show
two things:
>>> >> >>> >>> >>
>>> >> >>> >>> >> - why some people are doing bad PR
for Spark
>>> >> >>> >>> >>
>>> >> >>> >>> >> - how - in easy way - we can change
it and show that Spark
>>> is
>>> >> >>> >>> >> still on
>>> >> >>> >>> >> the
>>> >> >>> >>> >> top
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> No more, no less. Benchmarks will
be helpful, but I don't
>>> think
>>> >> >>> >>> >> they're the
>>> >> >>> >>> >> most important thing in Spark :)
On the Spark main page
>>> there
>>> >> >>> >>> >> is
>>> >> >>> >>> >> still
>>> >> >>> >>> >> chart
>>> >> >>> >>> >> "Spark vs Hadoop". It is important
to show that framework
>>> is
>>> >> >>> >>> >> not
>>> >> >>> >>> >> the
>>> >> >>> >>> >> same
>>> >> >>> >>> >> Spark with other API, but much faster
and optimized,
>>> comparable
>>> >> >>> >>> >> or
>>> >> >>> >>> >> even
>>> >> >>> >>> >> faster than other frameworks.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> About real-time streaming, I think
it would be just good
>>> to see
>>> >> >>> >>> >> it
>>> >> >>> >>> >> in
>>> >> >>> >>> >> Spark.
>>> >> >>> >>> >> I very like current Spark model,
but many voices that says
>>> "we
>>> >> >>> >>> >> need
>>> >> >>> >>> >> more" -
>>> >> >>> >>> >> community should listen also them
and try to help them.
>>> With
>>> >> >>> >>> >> SIPs
>>> >> >>> >>> >> it
>>> >> >>> >>> >> would
>>> >> >>> >>> >> be easier, I've just posted this
example as "thing that
>>> may be
>>> >> >>> >>> >> changed
>>> >> >>> >>> >> with
>>> >> >>> >>> >> SIP".
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> I very like unification via Datasets,
but there is a lot of
>>> >> >>> >>> >> algorithms
>>> >> >>> >>> >> inside - let's make easy API, but
with strong background
>>> >> >>> >>> >> (articles,
>>> >> >>> >>> >> benchmarks, descriptions, etc) that
shows that Spark is
>>> still
>>> >> >>> >>> >> modern
>>> >> >>> >>> >> framework.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> Maybe now my intention will be clearer
:) As I said
>>> >> >>> >>> >> organizational
>>> >> >>> >>> >> ideas
>>> >> >>> >>> >> were already mentioned and I agree
with them, my mail was
>>> just
>>> >> >>> >>> >> to
>>> >> >>> >>> >> show
>>> >> >>> >>> >> some
>>> >> >>> >>> >> aspects from my side, so from theside
of developer and
>>> person
>>> >> >>> >>> >> who
>>> >> >>> >>> >> is
>>> >> >>> >>> >> trying
>>> >> >>> >>> >> to help others with Spark (via StackOverflow
or other ways)
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>> >> >>> >>> >>
>>> >> >>> >>> >> Tomasz
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> ________________________________
>>> >> >>> >>> >> Od: Cody Koeninger <cody@koeninger.org>
>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>> >> >>> >>> >> Do: Debasish Das
>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>> >> >>> >>> >>
>>> >> >>> >>> >> I think narrowly focusing on Flink
or benchmarks is
>>> missing my
>>> >> >>> >>> >> point.
>>> >> >>> >>> >>
>>> >> >>> >>> >> My point is evolve or die.  Spark's
governance and
>>> organization
>>> >> >>> >>> >> is
>>> >> >>> >>> >> hampering its ability to evolve technologically,
and it
>>> needs
>>> >> >>> >>> >> to
>>> >> >>> >>> >> change.
>>> >> >>> >>> >>
>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM,
Debasish Das
>>> >> >>> >>> >> <debasish.das83@gmail.com>
>>> >> >>> >>> >> wrote:
>>> >> >>> >>> >>> Thanks Cody for bringing up a
valid point...I picked up
>>> Spark
>>> >> >>> >>> >>> in
>>> >> >>> >>> >>> 2014
>>> >> >>> >>> >>> as
>>> >> >>> >>> >>> soon as I looked into it since
compared to writing Java
>>> >> >>> >>> >>> map-reduce
>>> >> >>> >>> >>> and
>>> >> >>> >>> >>> Cascading code, Spark made writing
distributed code
>>> fun...But
>>> >> >>> >>> >>> now
>>> >> >>> >>> >>> as
>>> >> >>> >>> >>> we
>>> >> >>> >>> >>> went
>>> >> >>> >>> >>> deeper with Spark and real-time
streaming use-case gets
>>> more
>>> >> >>> >>> >>> prominent, I
>>> >> >>> >>> >>> think it is time to bring a messaging
model in conjunction
>>> >> >>> >>> >>> with
>>> >> >>> >>> >>> the
>>> >> >>> >>> >>> batch/micro-batch API that Spark
is good
>>> at....akka-streams
>>> >> >>> >>> >>> close
>>> >> >>> >>> >>> integration with spark micro-batching
APIs looks like a
>>> great
>>> >> >>> >>> >>> direction to
>>> >> >>> >>> >>> stay in the game with Apache
Flink...Spark 2.0 integrated
>>> >> >>> >>> >>> streaming
>>> >> >>> >>> >>> with
>>> >> >>> >>> >>> batch with the assumption is
that micro-batching is
>>> sufficient
>>> >> >>> >>> >>> to
>>> >> >>> >>> >>> run
>>> >> >>> >>> >>> SQL
>>> >> >>> >>> >>> commands on stream but do we
really have time to do SQL
>>> >> >>> >>> >>> processing at
>>> >> >>> >>> >>> streaming data within 1-2 seconds
?
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> After reading the email chain,
I started to look into
>>> Flink
>>> >> >>> >>> >>> documentation
>>> >> >>> >>> >>> and if you compare it with Spark
documentation, I think we
>>> >> >>> >>> >>> have
>>> >> >>> >>> >>> major
>>> >> >>> >>> >>> work
>>> >> >>> >>> >>> to do detailing out Spark internals
so that more people
>>> from
>>> >> >>> >>> >>> community
>>> >> >>> >>> >>> start
>>> >> >>> >>> >>> to take active role in improving
the issues so that Spark
>>> >> >>> >>> >>> stays
>>> >> >>> >>> >>> strong
>>> >> >>> >>> >>> compared to Flink.
>>> >> >>> >>> >>>
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>> uence/display/SPARK/Spark+Internals
>>> >> >>> >>> >>>
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>> uence/display/FLINK/Flink+Internals
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> Spark is no longer an engine
that works for micro-batch
>>> and
>>> >> >>> >>> >>> batch...We
>>> >> >>> >>> >>> (and
>>> >> >>> >>> >>> I am sure many others) are pushing
spark as an engine for
>>> >> >>> >>> >>> stream
>>> >> >>> >>> >>> and
>>> >> >>> >>> >>> query
>>> >> >>> >>> >>> processing.....we need to make
it a state-of-the-art
>>> engine
>>> >> >>> >>> >>> for
>>> >> >>> >>> >>> high
>>> >> >>> >>> >>> speed
>>> >> >>> >>> >>> streaming data and user queries
as well !
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30
PM, Tomasz Gawęda
>>> >> >>> >>> >>> <tomasz.gaweda@outlook.com>
>>> >> >>> >>> >>> wrote:
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Hi everyone,
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> I'm quite late with my answer,
but I think my
>>> suggestions may
>>> >> >>> >>> >>>> help a
>>> >> >>> >>> >>>> little bit. :) Many technical
and organizational topics
>>> were
>>> >> >>> >>> >>>> mentioned,
>>> >> >>> >>> >>>> but I want to focus on these
negative posts about Spark
>>> and
>>> >> >>> >>> >>>> about
>>> >> >>> >>> >>>> "haters"
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> I really like Spark. Easy
of use, speed, very good
>>> community
>>> >> >>> >>> >>>> -
>>> >> >>> >>> >>>> it's
>>> >> >>> >>> >>>> everything here. But Every
project has to "flight" on
>>> >> >>> >>> >>>> "framework
>>> >> >>> >>> >>>> market"
>>> >> >>> >>> >>>> to be still no 1. I'm following
many Spark and Big Data
>>> >> >>> >>> >>>> communities,
>>> >> >>> >>> >>>> maybe my mail will inspire
someone :)
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> You (every Spark developer;
so far I didn't have enough
>>> time
>>> >> >>> >>> >>>> to
>>> >> >>> >>> >>>> join
>>> >> >>> >>> >>>> contributing to Spark) has
done excellent job. So why are
>>> >> >>> >>> >>>> some
>>> >> >>> >>> >>>> people
>>> >> >>> >>> >>>> saying that Flink (or other
framework) is better, like
>>> it was
>>> >> >>> >>> >>>> posted
>>> >> >>> >>> >>>> in
>>> >> >>> >>> >>>> this mailing list? No, not
because that framework is
>>> better
>>> >> >>> >>> >>>> in
>>> >> >>> >>> >>>> all
>>> >> >>> >>> >>>> cases.. In my opinion, many
of these discussions where
>>> >> >>> >>> >>>> started
>>> >> >>> >>> >>>> after
>>> >> >>> >>> >>>> Flink marketing-like posts.
Please look at StackOverflow
>>> >> >>> >>> >>>> "Flink
>>> >> >>> >>> >>>> vs
>>> >> >>> >>> >>>> ...."
>>> >> >>> >>> >>>> posts, almost every post
in "winned" by Flink. Answers
>>> are
>>> >> >>> >>> >>>> sometimes
>>> >> >>> >>> >>>> saying nothing about other
frameworks, Flink's users
>>> (often
>>> >> >>> >>> >>>> PMC's)
>>> >> >>> >>> >>>> are
>>> >> >>> >>> >>>> just posting same information
about real-time streaming,
>>> >> >>> >>> >>>> about
>>> >> >>> >>> >>>> delta
>>> >> >>> >>> >>>> iterations, etc. It look
smart and very often it is
>>> marked as
>>> >> >>> >>> >>>> an
>>> >> >>> >>> >>>> aswer,
>>> >> >>> >>> >>>> even if - in my opinion -
there wasn't told all the
>>> truth.
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> My suggestion: I don't have
enough money and knowledgle
>>> to
>>> >> >>> >>> >>>> perform
>>> >> >>> >>> >>>> huge
>>> >> >>> >>> >>>> performance test. Maybe some
company, that supports Spark
>>> >> >>> >>> >>>> (Databricks,
>>> >> >>> >>> >>>> Cloudera? - just saying you're
most visible in community
>>> :) )
>>> >> >>> >>> >>>> could
>>> >> >>> >>> >>>> perform performance test
of:
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - streaming engine - probably
Spark will loose because of
>>> >> >>> >>> >>>> mini-batch
>>> >> >>> >>> >>>> model, however currently
the difference should be much
>>> lower
>>> >> >>> >>> >>>> that in
>>> >> >>> >>> >>>> previous versions
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - Machine Learning models
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - batch jobs
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - Graph jobs
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - SQL queries
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> People will see that Spark
is envolving and is also a
>>> modern
>>> >> >>> >>> >>>> framework,
>>> >> >>> >>> >>>> because after reading posts
mentioned above people may
>>> think
>>> >> >>> >>> >>>> "it
>>> >> >>> >>> >>>> is
>>> >> >>> >>> >>>> outdated, future is in framework
X".
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Matei Zaharia posted excellent
blog post about how Spark
>>> >> >>> >>> >>>> Structured
>>> >> >>> >>> >>>> Streaming beats every other
framework in terms of
>>> easy-of-use
>>> >> >>> >>> >>>> and
>>> >> >>> >>> >>>> reliability. Performance
tests, done in various
>>> environments
>>> >> >>> >>> >>>> (in
>>> >> >>> >>> >>>> example: laptop, small 2
node cluster, 10-node cluster,
>>> >> >>> >>> >>>> 20-node
>>> >> >>> >>> >>>> cluster), could be also very
good marketing stuff to say
>>> >> >>> >>> >>>> "hey,
>>> >> >>> >>> >>>> you're
>>> >> >>> >>> >>>> telling that you're better,
but Spark is still faster
>>> and is
>>> >> >>> >>> >>>> still
>>> >> >>> >>> >>>> getting even more fast!".
This would be based on facts
>>> (just
>>> >> >>> >>> >>>> numbers),
>>> >> >>> >>> >>>> not opinions. It would be
good for companies, for
>>> marketing
>>> >> >>> >>> >>>> puproses
>>> >> >>> >>> >>>> and
>>> >> >>> >>> >>>> for every Spark developer
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Second: real-time streaming.
I've written some time ago
>>> about
>>> >> >>> >>> >>>> real-time
>>> >> >>> >>> >>>> streaming support in Spark
Structured Streaming. Some
>>> work
>>> >> >>> >>> >>>> should be
>>> >> >>> >>> >>>> done to make SSS more low-latency,
but I think it's
>>> possible.
>>> >> >>> >>> >>>> Maybe
>>> >> >>> >>> >>>> Spark may look at Gearpump,
which is also built on top of
>>> >> >>> >>> >>>> Akka?
>>> >> >>> >>> >>>> I
>>> >> >>> >>> >>>> don't
>>> >> >>> >>> >>>> know yet, it is good topic
for SIP. However I think that
>>> >> >>> >>> >>>> Spark
>>> >> >>> >>> >>>> should
>>> >> >>> >>> >>>> have real-time streaming
support. Currently I see many
>>> >> >>> >>> >>>> posts/comments
>>> >> >>> >>> >>>> that "Spark has too big latency".
Spark Streaming is
>>> doing
>>> >> >>> >>> >>>> very
>>> >> >>> >>> >>>> good
>>> >> >>> >>> >>>> jobs with micro-batches,
however I think it is possible
>>> to
>>> >> >>> >>> >>>> add
>>> >> >>> >>> >>>> also
>>> >> >>> >>> >>>> more
>>> >> >>> >>> >>>> real-time processing.
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Other people said much more
and I agree with proposal of
>>> SIP.
>>> >> >>> >>> >>>> I'm
>>> >> >>> >>> >>>> also
>>> >> >>> >>> >>>> happy that PMC's are not
saying that they will not
>>> listen to
>>> >> >>> >>> >>>> users,
>>> >> >>> >>> >>>> but
>>> >> >>> >>> >>>> they really want to make
Spark better for every user.
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> What do you think about these
two topics? Especially I'm
>>> >> >>> >>> >>>> looking
>>> >> >>> >>> >>>> at
>>> >> >>> >>> >>>> Cody
>>> >> >>> >>> >>>> (who has started this topic)
and PMCs :)
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Tomasz
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>>
>>> >> >>> >>>
>>> >> >>> >>
>>> >> >>> >
>>> >> >>> >
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>
>

Mime
View raw message