spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eron Wright <>
Subject RE: (Spark SQL) partition-scoped UDF
Date Wed, 09 Sep 2015 16:30:29 GMT
Follow-up:  solved this problem by overriding the model's `transform` method, and using `mapPartitions`
to produce a new DataFrame rather than using `udf`.   
Source code:
Thanks Reynold for your time.
Date: Sat, 5 Sep 2015 13:55:34 -0700
Subject: Re: (Spark SQL) partition-scoped UDF

The transformer is a classification model produced by the NeuralNetClassification estimator
of dl4j-spark-ml.  Source code here.  The neural net operates most efficiently when many examples
are classified in batch.  I imagine overriding `transform` rather than `predictRaw`.   Does
anyone know of a solution compatible with Spark 1.4 or 1.5?
Thanks again!
From:  Reynold Xin
Date:  Friday, September 4, 2015 at 5:19 PM
To:  Eron Wright
Cc:  ""
Subject:  Re: (Spark SQL) partition-scoped UDF

Can you say more about your transformer?
This is a good idea, and indeed we are doing it for R already (the latest way to run UDFs
in R is to pass the entire partition as a local R dataframe for users to run on). However,
what works for R for simple data processing might not work for your high performance transformer,

On Fri, Sep 4, 2015 at 7:08 AM, Eron Wright <> wrote:
Transformers in Spark ML typically operate on a per-row basis, based on callUDF. For a new
transformer that I'm developing, I have a need to transform an entire partition with a function,
as opposed to transforming each row separately.   The reason is that, in my case, rows must
be transformed in batch for efficiency to amortize some overhead.   How may I accomplish this?
One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD that is then
converted back to a DataFrame.   Unsure about the viability or consequences of that.
Thanks!Eron Wright 		 	   		  
View raw message