we are heavy users of spark, but only for data munging, not for training our models. Now we are planing to use MLLib transformers for our offline transformations, and want to have something that can easily do the same transformations online.

We currently have the following plan:

- build online transformations, and an online pipeline, that can perform our online transformations
- allow MLLib pipelines to be transformed to these optimized online pipelines
- because we are not using MLLib models, we want to adjust some of the transformations to not need a vectors as input (like the standard scaler), but a list of columns to transform.
- to reduce discrepency between the online and the offline world, have the same code for online and offline transformations (so not the fitting)

Are there any plans in the pipeline already to attacked any of the above ideas? And if not, would there be any interest in having this added to MLLib?


Brammert Ottens
Data Scientist

Booking.com Customer Service Holding B.V.
Herengracht 597 Amsterdam 1017 CE Netherlands
Direct +31207094515
The world's #1 accommodation site 
43 languages, 198+ offices worldwide, 120,000+ global destinations, 1,550,000+ room nights booked every day 
No booking fees, best price always guaranteed 
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)