spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mendelson, Assaf" <Assaf.Mendel...@rsa.com>
Subject RE: DataFrame select non-existing column
Date Sun, 20 Nov 2016 09:55:43 GMT
The issue is that you already have a struct called pass. What you did was add a new columned
called "pass.mobile" instead of adding the element to pass - The schema for pass element is
the same as before.
When you do select pass.mobile, it finds the pass structure and checks for mobile in it.

You can do it the other way around: set the name to be: pass_mobile. Add it as before with
lit(0) for those that dataframes that do not have the mobile field and do something like withColumn("pass_mobile",
df["pass.modile"]) for those that do.
Another option is to use do something like df.select("pass.*") to flatten the pass structure
and work on that (then you can do withColumn("mobile",...) instead of "pass.mobile") but this
would change the schema.


-----Original Message-----
From: Kristoffer Sjögren [mailto:stoffe@gmail.com] 
Sent: Saturday, November 19, 2016 4:57 PM
To: Mendelson, Assaf
Cc: user
Subject: Re: DataFrame select non-existing column

Thanks. Here's my code example [1] and the printSchema() output [2].

This code still fails with the following message: "No such struct field mobile in auction,
geo"

By looking at the schema, it seems that pass.mobile did not get nested, which is the way it
needs to be for my use case. Is nested columns not supported by withColumn()?

[1]

DataFrame df = ctx.read().parquet(localPath).withColumn("pass.mobile", lit(0L)); dataFrame.printSchema();
dataFrame.select("pass.mobile");

[2]

root
 |-- pass: struct (nullable = true)
 |    |-- auction: struct (nullable = true)
 |    |    |-- id: integer (nullable = true)
 |    |-- geo: struct (nullable = true)
 |    |    |-- postalCode: string (nullable = true)
 |-- pass.mobile: long (nullable = false)

On Sat, Nov 19, 2016 at 7:45 AM, Mendelson, Assaf <Assaf.Mendelson@rsa.com> wrote:
> In pyspark for example you would do something like:
>
> df.withColumn("newColName",pyspark.sql.functions.lit(None))
>
> Assaf.
> -----Original Message-----
> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
> Sent: Friday, November 18, 2016 9:19 PM
> To: Mendelson, Assaf
> Cc: user
> Subject: Re: DataFrame select non-existing column
>
> Thanks for your answer. I have been searching the API for doing that but I could not
find how to do it?
>
> Could you give me a code snippet?
>
> On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf <Assaf.Mendelson@rsa.com> wrote:
>> You can always add the columns to old dataframes giving them null (or some literal)
as a preprocessing.
>>
>> -----Original Message-----
>> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
>> Sent: Friday, November 18, 2016 4:32 PM
>> To: user
>> Subject: DataFrame select non-existing column
>>
>> Hi
>>
>> We have evolved a DataFrame by adding a few columns but cannot write select statements
on these columns for older data that doesn't have them since they fail with a AnalysisException
with message "No such struct field".
>>
>> We also tried dropping columns but this doesn't work for nested columns.
>>
>> Any non-hacky ways to get around this?
>>
>> Cheers,
>> -Kristoffer
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
Mime
View raw message