Transformers in Spark ML typically operate on a per-row basis, based on callUDF. For a new transformer that I'm developing, I have a need to transform an entire partition with a function, as opposed to transforming each row separately. The reason is that, in my case, rows must be transformed in batch for efficiency to amortize some overhead. How may I accomplish this?
One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD that is then converted back to a DataFrame. Unsure about the viability or consequences of that.