spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eron Wright <ewri...@live.com>
Subject Re: (Spark SQL) partition-scoped UDF
Date Sat, 05 Sep 2015 20:55:34 GMT
The transformer is a classification model produced by the NeuralNetClassification estimator
of dl4j-spark-ml.  Source code here.  The neural net operates most efficiently when many examples
are classified in batch.  I imagine overriding `transform` rather than `predictRaw`.   Does
anyone know of a solution compatible with Spark 1.4 or 1.5?

Thanks again!

From:  Reynold Xin
Date:  Friday, September 4, 2015 at 5:19 PM
To:  Eron Wright
Cc:  "dev@spark.apache.org"
Subject:  Re: (Spark SQL) partition-scoped UDF

Can you say more about your transformer?

This is a good idea, and indeed we are doing it for R already (the latest way to run UDFs
in R is to pass the entire partition as a local R dataframe for users to run on). However,
what works for R for simple data processing might not work for your high performance transformer,
etc.


On Fri, Sep 4, 2015 at 7:08 AM, Eron Wright <ewright@live.com> wrote:
Transformers in Spark ML typically operate on a per-row basis, based on callUDF. For a new
transformer that I'm developing, I have a need to transform an entire partition with a function,
as opposed to transforming each row separately.   The reason is that, in my case, rows must
be transformed in batch for efficiency to amortize some overhead.   How may I accomplish this?

One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD that is then
converted back to a DataFrame.   Unsure about the viability or consequences of that.

Thanks!
Eron Wright
       



Mime
View raw message