spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denny Lee <denny.g....@gmail.com>
Subject Re: Revisiting Online serving of Spark models?
Date Wed, 30 May 2018 17:30:14 GMT
I most likely will not be able to join SF next week but definitely up for a
session after Summit in Seattle to dive further into this, eh?!

On Wed, May 30, 2018 at 9:32 AM Felix Cheung <felixcheung_m@hotmail.com>
wrote:

> Hi!
>
> Thank you! Let’s meet then
>
> June 6 4pm
>
> Moscone West Convention Center
> 800 Howard Street, San Francisco, CA 94103
> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>
> Ground floor (outside of conference area - should be available for all) -
> we will meet and decide where to go
>
> (Would not send invite because that would be too much noise for dev@)
>
> To paraphrase Joseph, we will use this to kick off the discusssion and
> post notes after and follow up online. As for Seattle, I would be very
> interested to meet in person lateen and discuss ;)
>
>
> _____________________________
> From: Saikat Kanjilal <sxk1969@hotmail.com>
> Sent: Tuesday, May 29, 2018 11:46 AM
>
> Subject: Re: Revisiting Online serving of Spark models?
> To: Maximiliano Felice <maximilianofelice@gmail.com>
> Cc: Felix Cheung <felixcheung_m@hotmail.com>, Holden Karau <
> holden@pigscanfly.ca>, Joseph Bradley <joseph@databricks.com>, Leif Walsh
> <leif.walsh@gmail.com>, dev <dev@spark.apache.org>
>
>
>
> Would love to join but am in Seattle, thoughts on how to make this work?
>
> Regards
>
> Sent from my iPhone
>
> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
> maximilianofelice@gmail.com> wrote:
>
> Big +1 to a meeting with fresh air.
>
> Could anyone send the invites? I don't really know which is the place
> Holden is talking about.
>
> 2018-05-29 14:27 GMT-03:00 Felix Cheung <felixcheung_m@hotmail.com>:
>
>> You had me at blue bottle!
>>
>> _____________________________
>> From: Holden Karau <holden@pigscanfly.ca>
>> Sent: Tuesday, May 29, 2018 9:47 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Felix Cheung <felixcheung_m@hotmail.com>
>> Cc: Saikat Kanjilal <sxk1969@hotmail.com>, Maximiliano Felice <
>> maximilianofelice@gmail.com>, Joseph Bradley <joseph@databricks.com>,
>> Leif Walsh <leif.walsh@gmail.com>, dev <dev@spark.apache.org>
>>
>>
>>
>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>> blue bottle and grab coffee (if the weather holds have our design meeting
>> outside :p)?
>>
>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <felixcheung_m@hotmail.com>
>> wrote:
>>
>>> Bump.
>>>
>>> ------------------------------
>>> *From:* Felix Cheung <felixcheung_m@hotmail.com>
>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>>> the Summit?
>>>
>>> (I propose we meet at the venue entrance so we could accommodate people
>>> might not be in the conference)
>>>
>>> ------------------------------
>>> *From:* Saikat Kanjilal <sxk1969@hotmail.com>
>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>> *To:* Maximiliano Felice
>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> I’m in the same exact boat as Maximiliano and have use cases as well for
>>> model serving and would love to join this discussion.
>>>
>>> Sent from my iPhone
>>>
>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>> maximilianofelice@gmail.com> wrote:
>>>
>>> Hi!
>>>
>>> I'm don't usually write a lot on this list but I keep up to date with
>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>> attention, as we're currently facing this issue at work. I'm attending to
>>> the summit and was wondering if it would it be possible for me to join that
>>> meeting. I might be able to share some helpful usecases and ideas.
>>>
>>> Thanks,
>>> Maximiliano Felice
>>>
>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <leif.walsh@gmail.com>
>>> escribió:
>>>
>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>> success using pyarrow’s parquet reader and have been quite happy with it
so
>>>> far. If your target is python (and probably if not now, then soon, R), you
>>>> should look in to it.
>>>>
>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <joseph@databricks.com>
>>>> wrote:
>>>>
>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>> * Our approach was to use the same format but write JSON instead of
>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>> format simplifies architecture.  Plus, some people want to check files
into
>>>>> version control, and JSON is nice for that.
>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>>>> handle Parquet in the online serving setting).
>>>>>
>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>> people are around at the Spark Summit, that could be a good time to meet
up
>>>>> & then post notes back to the dev list.
>>>>>
>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>
>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained
models
>>>>>> outside of Spark for online serving.
>>>>>>
>>>>>> What’s the next step? Would folks be interested in getting together
>>>>>> to discuss/get some feedback?
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Felix Cheung <felixcheung_m@hotmail.com>
>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>> To: Holden Karau <holden@pigscanfly.ca>, Joseph Bradley <
>>>>>> joseph@databricks.com>
>>>>>> Cc: dev <dev@spark.apache.org>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Huge +1 on this!
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:*holden.karau@gmail.com <holden.karau@gmail.com> on behalf
of
>>>>>> Holden Karau <holden@pigscanfly.ca>
>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>> *To:* Joseph Bradley
>>>>>> *Cc:* dev
>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>> joseph@databricks.com> wrote:
>>>>>>
>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
this.
>>>>>>>
>>>>>>> Awesome! I'm glad other folks think something like this belongs
in
>>>>>> Spark.
>>>>>>
>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>> versions of MLlib models which could be deployed without the
big Spark JARs
>>>>>>> and without a SparkContext or SparkSession.  There are related
commercial
>>>>>>> offerings like this : ) but the overhead of maintaining those
offerings is
>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying
logic across
>>>>>>> libraries will be well worth it.
>>>>>>>
>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>> * It'd be valuable to have this go beyond adding transform()
methods
>>>>>>> taking a Row to the current Models.  Instead, it would be ideal
to have
>>>>>>> local, lightweight versions of models in mllib-local, outside
of the main
>>>>>>> mllib package (for easier deployment with smaller & fewer
dependencies).
>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal
to
>>>>>>> utilize elements of Spark SQL, particularly Rows and Types, which
could be
>>>>>>> moved into a local sql package.
>>>>>>> * This architecture may require some awkward APIs currently to
have
>>>>>>> model prediction logic in mllib-local, local model classes in
mllib-local,
>>>>>>> and regular (DataFrame-friendly) model classes in mllib.  We
might find it
>>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate
this
>>>>>>> architecture while making it feasible for 3rd party developers
to extend
>>>>>>> MLlib APIs (especially in Java).
>>>>>>>
>>>>>> I agree this could be interesting, and feed into the other discussion
>>>>>> around when (or if) we should be considering Spark 3.0
>>>>>> I _think_ we could probably do it with optional traits people could
>>>>>> mix in to avoid breaking the current APIs but I could be wrong on
that
>>>>>> point.
>>>>>>
>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>> not be as important as per-Row transformations, but they would
be helpful
>>>>>>> for batching for higher throughput.
>>>>>>>
>>>>>> That could be interesting as well.
>>>>>>
>>>>>>>
>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>
>>>>>>> Joseph
>>>>>>>
>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <holden@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi y'all,
>>>>>>>>
>>>>>>>> With the renewed interest in ML in Apache Spark now seems
like a
>>>>>>>> good a time as any to revisit the online serving situation
in Spark ML. DB
>>>>>>>> & other's have done some excellent working moving a lot
of the necessary
>>>>>>>> tools into a local linear algebra package that doesn't depend
on having a
>>>>>>>> SparkContext.
>>>>>>>>
>>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>>> round this, but currently our individual transform/predict
methods are
>>>>>>>> private so they either need to copy or re-implement (or put
them selves in
>>>>>>>> org.apache.spark) to access them. How would folks feel about
adding a new
>>>>>>>> trait for ML pipeline stages to expose to do transformation
of single
>>>>>>>> element inputs (or local collections) that could be optionally
implemented
>>>>>>>> by stages which support this? That way we can have less copy
and paste code
>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>
>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>> projects is probably the right path, forward (folks have
different needs),
>>>>>>>> but I'd love to see us make it simpler for other projects
to build reliable
>>>>>>>> serving tools.
>>>>>>>>
>>>>>>>> I realize this maybe puts some of the folks in an awkward
position
>>>>>>>> with their own commercial offerings, but hopefully if we
make it easier for
>>>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Holden :)
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Joseph Bradley
>>>>>>>
>>>>>>> Software Engineer - Machine Learning
>>>>>>>
>>>>>>> Databricks, Inc.
>>>>>>>
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Joseph Bradley
>>>>>
>>>>> Software Engineer - Machine Learning
>>>>>
>>>>> Databricks, Inc.
>>>>>
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>> --
>>>> --
>>>> Cheers,
>>>> Leif
>>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>
>
>

Mime
View raw message