spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bis_g <>
Subject Pyspark Join and then column select is showing unexpected output
Date Thu, 07 Jun 2018 01:58:04 GMT
I am not sure if the long work is doing this to me but I am seeing some
unexpected behavior in spark 2.2.0

I have created a toy example as below

toy_df = spark.createDataFrame([
I create another dataframe

mdf = toy_df.filter(toy_df.drug == 'c')
as you know mdf would be
|     p1|   c|
Now If I do this

Surprisingly I get

| P1| D1|patient|drug|
| p2|  a|     p2|   a|
| p2|  b|     p2|   b|
| p2|  d|     p2|   d|
| p1|  a|     p1|   a|
| p1|  b|     p1|   b|
| p1|  c|     p1|   c|
but if I use

I do see the expected behavior

|     p2|   a|null|
|     p2|   b|null|
|     p2|   d|null|
|     p1|   a|   c|
|     p1|   b|   c|
|     p1|   c|   c|
and if I use an alias expression on one of the dataframes I do get the
expected behavior


| P1| D1|drug|
| p2|  a|null|
| p2|  b|null|
| p2|  d|null|
| p1|  a|   c|
| p1|  b|   c|
| p1|  c|   c|
So my question is what is the best way to select columns after join and is
this behavior normal

Sent from:

To unsubscribe e-mail:

View raw message