spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Distinct on Map data type -- SPARK-19893
Date Sat, 13 Jan 2018 08:14:52 GMT
A very simple example is
sql("select create_map(1, 'a', 2, 'b')")
  .union(sql("select create_map(2, 'b', 1, 'a')"))
  .distinct

By definition a map should not care about the order of its entries, so the
above query should return one record. However it returns 2 records before
SPARK-19893

On Sat, Jan 13, 2018 at 11:51 AM, HariKrishnan CK <ckhari4u@gmail.com>
wrote:

> Hi Wan, could you please be more specific on the scenarios where it will
> give wrong results. I checked distinct and intersect operators in many use
> cases i have and could not figure out a failure scenario giving wrong
> results.
>
> Thanks
>
>
> On Jan 12, 2018 7:36 PM, "Wenchen Fan" <cloud0fan@gmail.com> wrote:
>
> Actually Spark 2.1.0 doesn't work for your case, it may give you wrong
> result...
> We are still working on adding this feature, but before that, we should
> fail earlier instead of returning wrong result.
>
> On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u <ckhari4u@gmail.com> wrote:
>
>> I see SPARK-19893 is backported to Spark 2.1 and 2.0.1 as well. I do not
>> see
>> a clear justification for why SPARK 19893 is important and needed. I have
>> a
>> sample table which works fine with an earlier build of Spark 2.1.0. Now
>> that
>> the latest build is having the backport of SPARK-19893, its failing with
>> error:
>>
>> Error in query: Cannot have map type columns in DataFrame which calls set
>> operations(intersect, except, etc.), but the type of column metrics is
>> map<string,int>;;
>> Distinct
>>
>>
>> *In Old Build of Spark 2.1.0, I tried the below:*
>>
>>
>> create TABLE map_demo2
>> (
>> country_id BIGINT,
>> metrics MAP <STRING, int>
>> );
>>
>> insert into table map_demo2 select 2,map("chaka",102) ;
>> insert into table map_demo2 select 3,map("chaka",102) ;
>> insert into table map_demo2 select 4,map("mangaa",103) ;
>>
>>
>> spark-sql> select distinct metrics from map_demo2;
>> [Stage 0:>                                                          (0 +
>> 4)
>> / 5]18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8501 milliseconds
>> to
>> create the Initialization Vector used by CryptoStream
>> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8503 milliseconds to
>> create the Initialization Vector used by CryptoStream
>> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8497 milliseconds to
>> create the Initialization Vector used by CryptoStream
>> 18/01/12 21:55:41 WARN CryptoStreamUtils: It costs 8496 milliseconds to
>> create the Initialization Vector used by CryptoStream
>> [Stage 1:============================
>> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
>> ===
>> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
>> >                       (1[Stage
>> <https://maps.google.com/?q=1:%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3D%3E%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0+%C2%A0(1%5BStage&entry=gmail&source=g>
>> 1:===========================================>           (1[Stage
>> 1:======================================================>(1
>> {"mangaa":103}
>> {"chaka":102}
>> {"chaka":103}
>> Time taken: 15.331 seconds, Fetched 3 row(s)
>>
>> Here the simple distinct query works fine in Spark. Any thoughts why
>> DISTINCT/EXCEPT/INTERSECT operators are not supported on Map data types.
>> From the PR, it says,
>> // TODO: although map type is not orderable, technically map type should
>> be
>> able to be
>>  +          // used inequality comparison, remove this type check once we
>> support it.
>>
>> Could not figure out the issue caused by using the aforementioned
>> operators?
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
>

Mime
View raw message