spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Shrowty (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
Date Thu, 13 Oct 2016 23:21:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573512#comment-15573512
] 

Ashish Shrowty commented on SPARK-17709:
----------------------------------------

[~smilegator] I compiled with the added debug information and here is the output -

{code:java}
scala> val d1 = spark.sql("select * from testext2")
d1: org.apache.spark.sql.DataFrame = [productid: int, price: float ... 2 more fields]

scala> val df1 = d1.groupBy("companyid","productid").agg(sum("price").as("price"))
df1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 more field]

scala> val df2 = d1.groupBy("companyid","productid").agg(sum("count").as("count"))
df2: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 more field]

scala> df1.join(df2, Seq("companyid", "productid")).show
org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] can not be resolved
given input columns: [companyid, productid, price, count] ;;
'Join UsingJoin(Inner,List('companyid, 'productid))
:- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, sum(cast(price#123
as double)) AS price#166]
:  +- Project [productid#122, price#123, count#124, companyid#121]
:     +- SubqueryAlias testext2
:        +- Relation[productid#122,price#123,count#124,companyid#121] parquet
+- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, sum(cast(count#124
as bigint)) AS count#177L]
   +- Project [productid#122, price#123, count#124, companyid#121]
      +- SubqueryAlias testext2
         +- Relation[productid#122,price#123,count#124,companyid#121] parquet

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)
  ... 48 elided

{code}

> spark 2.0 join - column resolution error
> ----------------------------------------
>
>                 Key: SPARK-17709
>                 URL: https://issues.apache.org/jira/browse/SPARK-17709
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ashish Shrowty
>              Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial dataframe
that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from <hivetable>")  
> val df1 = d1.groupBy("key1","key2")
>           .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>           .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code works.
This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message