From Koert Kuipers <>
Subject Re: benefits of code gen
Date Fri, 10 Feb 2017 21:32:55 GMT
based on that i take it that math functions would be primary beneficiaries
since they work on primitives.

so if i take UnaryMathExpression as an example, would i not get the same
benefit if i change it to this?

abstract class UnaryMathExpression(val f: Double => Double, name: String)
  extends UnaryExpression with Serializable with ImplicitCastInputTypes {

  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType)
  override def dataType: DataType = DoubleType
  override def nullable: Boolean = true
  override def toString: String = s"$name($child)"
  override def prettyName: String = name

  protected override def nullSafeEval(input: Any): Any = {

  // name of function in java.lang.Math
  def funcName: String = name.toLowerCase

  def function(d: Double): Double = f(d)

  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
    val self = ctx.addReferenceObj(name, this, getClass.getName)
    defineCodeGen(ctx, ev, c => s"$self.function($c)")

admittedly in this case the benefit in terms of removing complex codegen is
not there (the codegen was only one line), but if i can remove codegen here
i could also remove it in lots of other places where it does get very
unwieldy simply by replacing it with calls to methods.

Function1 is specialized, so i think (or hope) that my version does no
extra boxes/unboxing.

On Fri, Feb 10, 2017 at 2:24 PM, Reynold Xin <> wrote:

> With complex types it doesn't work as well, but for primitive types the
> biggest benefit of whole stage codegen is that we don't even need to put
> the intermediate data into rows or columns anymore. They are just variables
> (stored in CPU registers).
> On Fri, Feb 10, 2017 at 8:22 PM, Koert Kuipers <> wrote:
>> so i have been looking for a while now at all the catalyst expressions,
>> and all the relative complex codegen going on.
>> so first off i get the benefit of codegen to turn a bunch of chained
>> iterators transformations into a single codegen stage for spark. that makes
>> sense to me, because it avoids a bunch of overhead.
>> but what i am not so sure about is what the benefit is of converting the
>> actual stuff that happens inside the iterator transformations into codegen.
>> say if we have an expression that has 2 children and creates a struct for
>> them. why would this be faster in codegen by re-creating the code to do
>> this in a string (which is complex and error prone) compared to simply have
>> the codegen call the normal method for this in my class?
>> i see so much trivial code be re-created in codegen. stuff like this:
>>   private[this] def castToDateCode(
>>       from: DataType,
>>       ctx: CodegenContext): CastFunction = from match {
>>     case StringType =>
>>       val intOpt = ctx.freshName("intOpt")
>>       (c, evPrim, evNull) => s"""
>>         scala.Option<Integer> $intOpt =
>>           org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToDat
>> e($c);
>>         if ($intOpt.isDefined()) {
>>           $evPrim = ((Integer) $intOpt.get()).intValue();
>>         } else {
>>           $evNull = true;
>>         }
>>        """
>> is this really faster than simply calling an equivalent functions from
>> the codegen, and keeping the codegen logic restricted to the "unrolling" of
>> chained iterators?

