spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ofer Mendelevitch (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-10892) Join with Data Frame returns wrong results
Date Thu, 01 Oct 2015 00:19:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ofer Mendelevitch updated SPARK-10892:
--------------------------------------
    Attachment: data.json

Data file to run with the code, for reproducing.

> Join with Data Frame returns wrong results
> ------------------------------------------
>
>                 Key: SPARK-10892
>                 URL: https://issues.apache.org/jira/browse/SPARK-10892
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1, 1.5.0
>            Reporter: Ofer Mendelevitch
>            Priority: Critical
>         Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with the same
name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but the two other
columns seem to have a copy of the values from the first column.
> Here's the sample code:
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>           .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
>             prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
>             tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> The output is:
> +--------+---+------+-----+-----------+-----+----+
> |date_str|day|metric|month|    station|value|year|
> +--------+---+------+-----+-----------+-----+----+
> |20121001|  1|  PRCP|   10|USW00023272|    0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|    0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|    0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|    0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|    0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|    0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|    0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|    0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|    0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|    0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|    3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|    0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|    0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|    0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|    0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|    0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|    0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|    0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|    0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|    0|2012|
> +--------+---+------+-----+-----------+-----+——+
> +--------+---+------+-----+-----------+-----+----+
> |date_str|day|metric|month|    station|value|year|
> +--------+---+------+-----+-----------+-----+----+
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> +--------+---+------+-----+-----------+-----+——+
> +--------+---+------+-----+-----------+-----+----+
> |date_str|day|metric|month|    station|value|year|
> +--------+---+------+-----+-----------+-----+----+
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |20121006|  6|  TMAX|   10|USW00023272|  200|2012|
> |20121007|  7|  TMAX|   10|USW00023272|  167|2012|
> |20121008|  8|  TMAX|   10|USW00023272|  183|2012|
> |20121009|  9|  TMAX|   10|USW00023272|  194|2012|
> |20121010| 10|  TMAX|   10|USW00023272|  183|2012|
> |20121011| 11|  TMAX|   10|USW00023272|  139|2012|
> |20121012| 12|  TMAX|   10|USW00023272|  161|2012|
> |20121013| 13|  TMAX|   10|USW00023272|  211|2012|
> |20121014| 14|  TMAX|   10|USW00023272|  189|2012|
> |20121015| 15|  TMAX|   10|USW00023272|  233|2012|
> |20121016| 16|  TMAX|   10|USW00023272|  211|2012|
> |20121017| 17|  TMAX|   10|USW00023272|  278|2012|
> |20121018| 18|  TMAX|   10|USW00023272|  294|2012|
> |20121019| 19|  TMAX|   10|USW00023272|  194|2012|
> |20121020| 20|  TMAX|   10|USW00023272|  183|2012|
> +--------+---+------+-----+-----------+-----+——+
> And the join output is:
> +----+-----+---+--------+----+----+----+
> |year|month|day|date_str|PRCP|TMIN|TMAX|
> +----+-----+---+--------+----+----+----+
> |2012|   10|  1|20121001|   0|   0|   0|
> |2012|   10|  2|20121002|   0|   0|   0|
> |2012|   10|  3|20121003|   0|   0|   0|
> |2012|   10|  4|20121004|   0|   0|   0|
> |2012|   10|  5|20121005|   0|   0|   0|
> |2012|   10|  6|20121006|   0|   0|   0|
> |2012|   10|  7|20121007|   0|   0|   0|
> |2012|   10|  8|20121008|   0|   0|   0|
> |2012|   10|  9|20121009|   0|   0|   0|
> |2012|   10| 10|20121010|   0|   0|   0|
> |2012|   10| 11|20121011|   3|   3|   3|
> |2012|   10| 12|20121012|   0|   0|   0|
> |2012|   10| 13|20121013|   0|   0|   0|
> |2012|   10| 14|20121014|   0|   0|   0|
> |2012|   10| 15|20121015|   0|   0|   0|
> |2012|   10| 16|20121016|   0|   0|   0|
> |2012|   10| 17|20121017|   0|   0|   0|
> |2012|   10| 18|20121018|   0|   0|   0|
> |2012|   10| 19|20121019|   0|   0|   0|
> |2012|   10| 20|20121020|   0|   0|   0|
> +----+-----+---+--------+----+----+——+
> Attachment:
> - data.json file that is read from HDFS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message