spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brammert Ottens <brammert.ott...@booking.com>
Subject Standard scaler on multiple columsn without a vector
Date Thu, 26 Apr 2018 08:05:08 GMT
Hi,

we are heavy users of spark, but only for data munging, not for training
our models. Now we are planing to use MLLib transformers for our offline
transformations, and want to have something that can easily do the same
transformations online.

We currently have the following plan:

- build online transformations, and an online pipeline, that can perform
our online transformations
- allow MLLib pipelines to be transformed to these optimized online
pipelines
- because we are not using MLLib models, we want to adjust some of the
transformations to not need a vectors as input (like the standard scaler),
but a list of columns to transform.
- to reduce discrepency between the online and the offline world, have the
same code for online and offline transformations (so not the fitting)

Are there any plans in the pipeline already to attacked any of the above
ideas? And if not, would there be any interest in having this added to
MLLib?

Brammert


Brammert Ottens
Data Scientist

Booking.com Customer Service Holding B.V.
Herengracht 597 Amsterdam 1017 CE Netherlands
Direct +31207094515
[image: Booking.com] <http://www.booking.com/>
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)

Mime
View raw message