spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: (Spark SQL) partition-scoped UDF
Date Sat, 05 Sep 2015 00:19:57 GMT
Can you say more about your transformer?

This is a good idea, and indeed we are doing it for R already (the latest
way to run UDFs in R is to pass the entire partition as a local R dataframe
for users to run on). However, what works for R for simple data processing
might not work for your high performance transformer, etc.

On Fri, Sep 4, 2015 at 7:08 AM, Eron Wright <> wrote:

> Transformers in Spark ML typically operate on a per-row basis, based on
> callUDF. For a new transformer that I'm developing, I have a need to
> transform an entire partition with a function, as opposed to transforming
> each row separately.   The reason is that, in my case, rows must be
> transformed in batch for efficiency to amortize some overhead.   How may I
> accomplish this?
> One option appears to be to invoke DataFrame::mapPartitions, yielding an
> RDD that is then converted back to a DataFrame.   Unsure about the
> viability or consequences of that.
> Thanks!
> Eron Wright

View raw message