spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernard Jesop <bernard.je...@gmail.com>
Subject Re: Analysis Exception after join
Date Tue, 04 Jul 2017 15:29:02 GMT
It seems to be because of this issues:
https://issues.apache.org/jira/browse/SPARK-10925

I added a checkpoint, as suggested, to break the lineage and it worked.

Best regards,

2017-07-04 17:26 GMT+02:00 Bernard Jesop <bernard.jesop@gmail.com>:

> Thank Didac,
>
> My bad, actually this code is incomplete, it should have been : - dfAgg =
> df.groupBy("S_ID").agg(...).
>
> I want to access the aggregated values (of dfAgg) for each row of 'df',
> that is why I do a left outer join.
>
>
> Also, regarding the second parameter, I am using this signature of join :
> (Dataset, Seq[String], String) => Dataset.
>
> It is a sequence of the column names to use as keys for the join, some
> kind of syntactic sugar for (df1("key1") === df2("key1") && df1("key2") ===
> df2("key2") && ....),
> except it will not duplicate the columns used as keys.
>
> 2017-07-03 12:39 GMT+02:00 Didac Gil <didacgil9dev@gmail.com>:
>
>> With the left join, you are joining two tables.
>>
>> In your case, df is the left table, dfAgg is the right table.
>> The second parameter should be the joining condition, right?
>> For instance
>>
>>  dfRes = df.join(dfAgg, $”userName”===$”name", "left_outer”)
>>
>> having a field in df called userName, and another in dfAgg called “name”
>>
>> However, what’s the kind of query you want to make? dfAgg is already the
>> df table that has been grouped by S_ID.
>>
>> I guess that you are looking for something more like the following example
>> dfAgg = df.groupBy("S_ID”)
>>    .agg(org.apache.spark.sql.functions.count(*“userName"*).as(
>> *“usersCount**”*),
>>  .agg(org.apache.spark.sql.functions.collect_set(“city")
>> .as("ListofCities”)),
>>  .agg(*org.apache.spark.sql.functions.max(“age").as(“oldest”))*
>> )
>>
>> On 3 Jul 2017, at 11:55, Bernard Jesop <bernard.jesop@gmail.com> wrote:
>>
>> Hello, I don't understand my error message.
>>
>> Basically, all I am doing is :
>> - dfAgg = df.groupBy("S_ID")
>> - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")
>>
>> However I get this AnalysisException: "
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> resolved attribute(s) S_ID#1903L missing from Dummy_ID#740,sex#37L,PERSONAL_
>> STATUS#726L,W_DEP_CODE#736,W_SIZE#739L,
>> POSTAL_CODE#735,COUNTRY_CODE#730,
>> ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L,
>> HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728,
>> S_ID#57L,NATIONALITY#731
>> in operator !Project [ID#724L, sex#37L, PERSON\
>>  AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730,
>> NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L,
>> HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L,
>> W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; "
>>
>> What I don't understand is it says S_ID#1903L is missing
>> but everything seems fine on the Logical Plan.
>> +- Join LeftOuter, (S_ID#57L = S_ID#1903L)
>>
>>    :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L,
>> AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L,
>> familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\
>>  L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157,
>> Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039,
>> infectedFamily#1355L, infectedWorker#1385L]
>>
>>     +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS
>> infectedStreet#1415L]
>>
>> Does someone have a clue about it?
>> Thanks,
>>
>>
>>
>>
>> Didac Gil de la Iglesia
>> PhD in Computer Science
>> didacgil9@gmail.com
>> Spain:     +34 696 285 544 <+34%20696%2028%2055%2044>
>> Sweden: +46 (0)730229737 <+46%2073%20022%2097%2037>
>> Skype: didac.gil.de.la.iglesia
>>
>>
>

Mime
View raw message