spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: unsure how to create 2 outputs from spark-sql udf expression
Date Fri, 27 May 2016 03:30:15 GMT
You couldn't do like this?

--
val func = udf((i: Int) => Tuple2(i, i))
val df = Seq((1, ..., 0), (2, ..., 5)).toDF("input", "c0", "c1", ....other
needed columns...., "cX")
df.select(func($"a").as("r"), $"c0", $"c1", ....$"cX").select($"r._1",
$"r._2", $"c0", $"c1", ....$"cX")

// maropu


On Fri, May 27, 2016 at 12:15 PM, Koert Kuipers <koert@tresata.com> wrote:

> yes, but i also need all the columns (plus of course the 2 new ones) in my
> output. your select operation drops all the input columns.
> best, koert
>
> On Thu, May 26, 2016 at 11:02 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
> wrote:
>
>> Couldn't you include all the needed columns in your input dataframe?
>>
>> // maropu
>>
>> On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> that is nice and compact, but it does not add the columns to an existing
>>> dataframe
>>>
>>> On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro <
>>> linguin.m.s@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> How about this?
>>>> --
>>>> val func = udf((i: Int) => Tuple2(i, i))
>>>> val df = Seq((1, 0), (2, 5)).toDF("a", "b")
>>>> df.select(func($"a").as("r")).select($"r._1", $"r._2")
>>>>
>>>> // maropu
>>>>
>>>>
>>>> On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers <koert@tresata.com>
>>>> wrote:
>>>>
>>>>> hello all,
>>>>>
>>>>> i have a single udf that creates 2 outputs (so a tuple 2). i would
>>>>> like to add these 2 columns to my dataframe.
>>>>>
>>>>> my current solution is along these lines:
>>>>> df
>>>>>   .withColumn("_temp_", udf(inputColumns))
>>>>>   .withColumn("x", col("_temp_)("_1"))
>>>>>   .withColumn("y", col("_temp_")("_2"))
>>>>>   .drop("_temp_")
>>>>>
>>>>> this works, but its not pretty with the temporary field stuff.
>>>>>
>>>>> i also tried this:
>>>>> val tmp = udf(inputColumns)
>>>>> df
>>>>>   .withColumn("x", tmp("_1"))
>>>>>   .withColumn("y", tmp("_2"))
>>>>>
>>>>> this also works, but unfortunately the udf is evaluated twice
>>>>>
>>>>> is there a better way to do this?
>>>>>
>>>>> thanks! koert
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro

Mime
View raw message