spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eron Wright <>
Subject (Spark SQL) partition-scoped UDF
Date Fri, 04 Sep 2015 17:08:48 GMT
Transformers in Spark ML typically operate on a per-row basis, based on callUDF. For a new
transformer that I'm developing, I have a need to transform an entire partition with a function,
as opposed to transforming each row separately.   The reason is that, in my case, rows must
be transformed in batch for efficiency to amortize some overhead.   How may I accomplish this?
One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD that is then
converted back to a DataFrame.   Unsure about the viability or consequences of that.
Thanks!Eron Wright 		 	   		  
View raw message