spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subhash Sriram <subhash.sri...@gmail.com>
Subject Re: question on transforms for spark 2.0 dataset
Date Wed, 01 Mar 2017 17:20:06 GMT
If I am understanding your problem correctly, I think you can just create a
new DataFrame that is a transformation of sample_data by first registering
sample_data as a temp table.

//Register temp table
sample_data.createOrReplaceTempView("sql_sample_data")

//Create new DataSet with transformed values
val transformed = spark.sql("select trim(field1) as field1, trim(field2) as
field2...... from sql_sample_data")

//Test
transformed.show(10)

I hope that helps!
Subhash


On Wed, Mar 1, 2017 at 12:04 PM, Marco Mistroni <mmistroni@gmail.com> wrote:

> Hi I think u need an UDF if u want to transform a column....
> Hth
>
> On 1 Mar 2017 4:22 pm, "Bill Schwanitz" <bilsch@bilsch.org> wrote:
>
>> Hi all,
>>
>> I'm fairly new to spark and scala so bear with me.
>>
>> I'm working with a dataset containing a set of column / fields. The data
>> is stored in hdfs as parquet and is sourced from a postgres box so fields
>> and values are reasonably well formed. We are in the process of trying out
>> a switch from pentaho and various sql databases to pulling data into hdfs
>> and applying transforms / new datasets with processing being done in spark
>> ( and other tools - evaluation )
>>
>> A rough version of the code I'm running so far:
>>
>> val sample_data = spark.read.parquet("my_data_input")
>>
>> val example_row = spark.sql("select * from parquet.my_data_input where id
>> = 123").head
>>
>> I want to apply a trim operation on a set of fields - lets call them
>> field1, field2, field3 and field4.
>>
>> What is the best way to go about applying those trims and creating a new
>> dataset? Can I apply the trip to all fields in a single map? or do I need
>> to apply multiple map functions?
>>
>> When I try the map ( even with a single )
>>
>> scala> val transformed_data = sample_data.map(
>>      |   _.trim(col("field1"))
>>      |   .trim(col("field2"))
>>      |   .trim(col("field3"))
>>      |   .trim(col("field4"))
>>      | )
>>
>> I end up with the following error:
>>
>> <console>:26: error: value trim is not a member of
>> org.apache.spark.sql.Row
>>          _.trim(col("field1"))
>>            ^
>>
>> Any ideas / guidance would be appreciated!
>>
>

Mime
View raw message