spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: Revisiting Online serving of Spark models?
Date Tue, 05 Jun 2018 22:06:59 GMT
I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.

On Sun, 3 Jun 2018 at 00:24 Holden Karau <holden@pigscanfly.ca> wrote:

> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> maximilianofelice@gmail.com> wrote:
>
>> Hi!
>>
>> We're already in San Francisco waiting for the summit. We even think that
>> we spotted @holdenk this afternoon.
>>
> Unless you happened to be walking by my garage probably not super likely,
> spent the day working on scooters/motorcycles (my style is a little less
> unique in SF :)). Also if you see me feel free to say hi unless I look like
> I haven't had my first coffee of the day, love chatting with folks IRL :)
>
>>
>> @chris, we're really interested in the Meetup you're hosting. My team
>> will probably join it since the beginning of you have room for us, and I'll
>> join it later after discussing the topics on this thread. I'll send you an
>> email regarding this request.
>>
>> Thanks
>>
>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <sxk1969@hotmail.com>
>> escribió:
>>
>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>> folks
>>>
>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>> meetup around model serving in spark at my work or elsewhere close,
>>> thoughts?  I’m actually in the midst of building microservices to manage
>>> models and when I say models I mean much more than machine learning models
>>> (think OR, process models as well)
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 31, 2018, at 10:32 PM, Chris Fregly <chris@fregly.com> wrote:
>>>
>>> Hey everyone!
>>>
>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>> calendar event - mostly for me, so i don’t forget!  :)
>>>
>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>> TensorFlow Meetup*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
@5:30pm
>>> on June 6th (same night) here in SF!
>>>
>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>> includes the signup link:
>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>
>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>> ground.
>>>
>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>> posting the recording afterward.
>>>
>>> All details are in the meetup link above…
>>>
>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>> welcome to give a talk. I can move things around to make room.
>>>
>>> @joseph:  I’d personally like an update on the direction of the
>>> Databricks proprietary ML Serving export format which is similar to PMML
>>> but not a standard in any way.
>>>
>>> Also, the Databricks ML Serving Runtime is only available to Databricks
>>> customers.  This seems in conflict with the community efforts described
>>> here.  Can you comment on behalf of Databricks?
>>>
>>> Look forward to your response, joseph.
>>>
>>> See you all soon!
>>>
>>> —
>>>
>>>
>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>>> Users)
>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>>> Global Members)
>>>
>>>
>>>
>>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf
>>> *
>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>> <http://community.pipeline.ai/>*
>>>
>>>
>>> On May 30, 2018, at 9:32 AM, Felix Cheung <felixcheung_m@hotmail.com>
>>> wrote:
>>>
>>> Hi!
>>>
>>> Thank you! Let’s meet then
>>>
>>> June 6 4pm
>>>
>>> Moscone West Convention Center
>>> 800 Howard Street, San Francisco, CA 94103
>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>>
>>> Ground floor (outside of conference area - should be available for all)
>>> - we will meet and decide where to go
>>>
>>> (Would not send invite because that would be too much noise for dev@)
>>>
>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>> post notes after and follow up online. As for Seattle, I would be very
>>> interested to meet in person lateen and discuss ;)
>>>
>>>
>>> _____________________________
>>> From: Saikat Kanjilal <sxk1969@hotmail.com>
>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Maximiliano Felice <maximilianofelice@gmail.com>
>>> Cc: Felix Cheung <felixcheung_m@hotmail.com>, Holden Karau <
>>> holden@pigscanfly.ca>, Joseph Bradley <joseph@databricks.com>, Leif
>>> Walsh <leif.walsh@gmail.com>, dev <dev@spark.apache.org>
>>>
>>>
>>> Would love to join but am in Seattle, thoughts on how to make this work?
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>> maximilianofelice@gmail.com> wrote:
>>>
>>> Big +1 to a meeting with fresh air.
>>>
>>> Could anyone send the invites? I don't really know which is the place
>>> Holden is talking about.
>>>
>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <felixcheung_m@hotmail.com>:
>>>
>>>> You had me at blue bottle!
>>>>
>>>> _____________________________
>>>> From: Holden Karau <holden@pigscanfly.ca>
>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>> To: Felix Cheung <felixcheung_m@hotmail.com>
>>>> Cc: Saikat Kanjilal <sxk1969@hotmail.com>, Maximiliano Felice <
>>>> maximilianofelice@gmail.com>, Joseph Bradley <joseph@databricks.com>,
>>>> Leif Walsh <leif.walsh@gmail.com>, dev <dev@spark.apache.org>
>>>>
>>>>
>>>>
>>>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>>>> blue bottle and grab coffee (if the weather holds have our design meeting
>>>> outside :p)?
>>>>
>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>> felixcheung_m@hotmail.com> wrote:
>>>>
>>>>> Bump.
>>>>>
>>>>> ------------------------------
>>>>> *From:* Felix Cheung <felixcheung_m@hotmail.com>
>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>
>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>
>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>> (near) the Summit?
>>>>>
>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>> people might not be in the conference)
>>>>>
>>>>> ------------------------------
>>>>> *From:* Saikat Kanjilal <sxk1969@hotmail.com>
>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>> *To:* Maximiliano Felice
>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>
>>>>> I’m in the same exact boat as Maximiliano and have use cases as well
>>>>> for model serving and would love to join this discussion.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>> maximilianofelice@gmail.com> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> I'm don't usually write a lot on this list but I keep up to date with
>>>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>>>> attention, as we're currently facing this issue at work. I'm attending
to
>>>>> the summit and was wondering if it would it be possible for me to join
that
>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>
>>>>> Thanks,
>>>>> Maximiliano Felice
>>>>>
>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <leif.walsh@gmail.com>
>>>>> escribió:
>>>>>
>>>>>> I’m with you on json being more readable than parquet, but we’ve
had
>>>>>> success using pyarrow’s parquet reader and have been quite happy
with it so
>>>>>> far. If your target is python (and probably if not now, then soon,
R), you
>>>>>> should look in to it.
>>>>>>
>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <joseph@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Regarding model reading and writing, I'll give quick thoughts
here:
>>>>>>> * Our approach was to use the same format but write JSON instead
of
>>>>>>> Parquet.  It's easier to parse JSON without Spark, and using
the same
>>>>>>> format simplifies architecture.  Plus, some people want to check
files into
>>>>>>> version control, and JSON is nice for that.
>>>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe,
eventually,
>>>>>>> handle Parquet in the online serving setting).
>>>>>>>
>>>>>>> This would be a big project, so proposing a SPIP might be best.
 If
>>>>>>> people are around at the Spark Summit, that could be a good time
to meet up
>>>>>>> & then post notes back to the dev list.
>>>>>>>
>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>>>
>>>>>>>> Specifically I’d like bring part of the discussion to Model
and
>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite
implementations
>>>>>>>> that rely on SparkContext. This is a big blocker on reusing
 trained models
>>>>>>>> outside of Spark for online serving.
>>>>>>>>
>>>>>>>> What’s the next step? Would folks be interested in getting
together
>>>>>>>> to discuss/get some feedback?
>>>>>>>>
>>>>>>>>
>>>>>>>> _____________________________
>>>>>>>> From: Felix Cheung <felixcheung_m@hotmail.com>
>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>> To: Holden Karau <holden@pigscanfly.ca>, Joseph Bradley
<
>>>>>>>> joseph@databricks.com>
>>>>>>>> Cc: dev <dev@spark.apache.org>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Huge +1 on this!
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> *From:*holden.karau@gmail.com <holden.karau@gmail.com>
on behalf
>>>>>>>> of Holden Karau <holden@pigscanfly.ca>
>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>> *To:* Joseph Bradley
>>>>>>>> *Cc:* dev
>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>> joseph@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter
of
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Awesome! I'm glad other folks think something like this
belongs in
>>>>>>>> Spark.
>>>>>>>>
>>>>>>>>> This was one of the original goals for mllib-local: to
have local
>>>>>>>>> versions of MLlib models which could be deployed without
the big Spark JARs
>>>>>>>>> and without a SparkContext or SparkSession.  There are
related commercial
>>>>>>>>> offerings like this : ) but the overhead of maintaining
those offerings is
>>>>>>>>> pretty high.  Building good APIs within MLlib to avoid
copying logic across
>>>>>>>>> libraries will be well worth it.
>>>>>>>>>
>>>>>>>>> We've talked about this need at Databricks and have also
been
>>>>>>>>> syncing with the creators of MLeap.  It'd be great to
get this
>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>>> methods taking a Row to the current Models.  Instead,
it would be ideal to
>>>>>>>>> have local, lightweight versions of models in mllib-local,
outside of the
>>>>>>>>> main mllib package (for easier deployment with smaller
& fewer
>>>>>>>>> dependencies).
>>>>>>>>> * Supporting Pipelines is important.  For this, it would
be ideal
>>>>>>>>> to utilize elements of Spark SQL, particularly Rows and
Types, which could
>>>>>>>>> be moved into a local sql package.
>>>>>>>>> * This architecture may require some awkward APIs currently
to
>>>>>>>>> have model prediction logic in mllib-local, local model
classes in
>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes
in mllib.  We
>>>>>>>>> might find it helpful to break some DeveloperApis in
Spark 3.0 to
>>>>>>>>> facilitate this architecture while making it feasible
for 3rd party
>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>
>>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>>> discussion around when (or if) we should be considering Spark
3.0
>>>>>>>> I _think_ we could probably do it with optional traits people
could
>>>>>>>> mix in to avoid breaking the current APIs but I could be
wrong on that
>>>>>>>> point.
>>>>>>>>
>>>>>>>>> * It could also be worth discussing local DataFrames.
 They might
>>>>>>>>> not be as important as per-Row transformations, but they
would be helpful
>>>>>>>>> for batching for higher throughput.
>>>>>>>>>
>>>>>>>> That could be interesting as well.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>
>>>>>>>>> Joseph
>>>>>>>>>
>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <holden@pigscanfly.ca
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi y'all,
>>>>>>>>>>
>>>>>>>>>> With the renewed interest in ML in Apache Spark now
seems like a
>>>>>>>>>> good a time as any to revisit the online serving
situation in Spark ML. DB
>>>>>>>>>> & other's have done some excellent working moving
a lot of the necessary
>>>>>>>>>> tools into a local linear algebra package that doesn't
depend on having a
>>>>>>>>>> SparkContext.
>>>>>>>>>>
>>>>>>>>>> There are a few different commercial and non-commercial
solutions
>>>>>>>>>> round this, but currently our individual transform/predict
methods are
>>>>>>>>>> private so they either need to copy or re-implement
(or put them selves in
>>>>>>>>>> org.apache.spark) to access them. How would folks
feel about adding a new
>>>>>>>>>> trait for ML pipeline stages to expose to do transformation
of single
>>>>>>>>>> element inputs (or local collections) that could
be optionally implemented
>>>>>>>>>> by stages which support this? That way we can have
less copy and paste code
>>>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>>>
>>>>>>>>>> I think continuing to have on-line serving grow in
different
>>>>>>>>>> projects is probably the right path, forward (folks
have different needs),
>>>>>>>>>> but I'd love to see us make it simpler for other
projects to build reliable
>>>>>>>>>> serving tools.
>>>>>>>>>>
>>>>>>>>>> I realize this maybe puts some of the folks in an
awkward
>>>>>>>>>> position with their own commercial offerings, but
hopefully if we make it
>>>>>>>>>> easier for everyone the commercial vendors can benefit
as well.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Holden :)
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Joseph Bradley
>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>> Databricks, Inc.
>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joseph Bradley
>>>>>>> Software Engineer - Machine Learning
>>>>>>> Databricks, Inc.
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Cheers,
>>>>>> Leif
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Mime
View raw message