spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravisankar Mani <rrav...@gmail.com>
Subject Hive Query(Top N)
Date Mon, 20 Jul 2015 06:52:06 GMT
Hi everyone

I have tried to  to achieve hierarchical based (index mode) top n creation
using spark query. it taken more time when i execute following query

Select SUM(`adventurepersoncontacts`.`contactid`)  AS
`adventurepersoncontacts_contactid` ,
`adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname`
FROM  `default`.`adventurepersoncontacts` AS adventurepersoncontacts
 JOIN (
 SELECT
  `F_0`.`fullname_0_0` AS `fullname_0_0`,
  ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC)  AS `R_N_0`
  FROM(
   SELECT
    `adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
    SUM(adventurepersoncontacts.contactid) AS `Measure_0`
    FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
    GROUP BY `adventurepersoncontacts`.`fullname`
   )`F_0`
) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` )
AND (`T_0`.`R_N_0` <= 5)) GROUP BY `adventurepersoncontacts`.`fullname

In mentioned query, I have set row index in every group according to the
aggregation type. Row_number calculation according to the aggregation like
Row_Number() Over( order by sum( column name) order by DESC) not directly
supported in spark . So i have using sub query like

 SELECT
  `F_0`.`fullname_0_0` AS `fullname_0_0`,
  ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC)  AS `R_N_0`
  FROM(
   SELECT
    `adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
    SUM(adventurepersoncontacts.contactid) AS `Measure_0`
    FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
    GROUP BY `adventurepersoncontacts`.`fullname`
   )`F_0`
) `T_0`

The execution time was getting slowed when using following way. In case i
removed  main group by( remove  aggregation) method , it getting very fast
execution.
Refer following query.  I mentioned this query only for slow execution when
i include main group by


 Select  `adventurepersoncontacts`.`contactid`  AS
`adventurepersoncontacts_contactid` ,
`adventurepersoncontacts`.`fullname` AS `adventurepersoncontacts_fullname`
FROM  `default`.`adventurepersoncontacts` AS adventurepersoncontacts
 JOIN (
 SELECT
  `F_0`.`fullname_0_0` AS `fullname_0_0`,
  ROW_NUMBER() OVER( ORDER BY `F_0`.`Measure_0` DESC)  AS `R_N_0`
  FROM(
   SELECT
    `adventurepersoncontacts`.`fullname` AS `fullname_0_0`,
    SUM(adventurepersoncontacts.contactid) AS `Measure_0`
    FROM `default`.`adventurepersoncontacts` AS adventurepersoncontacts
    GROUP BY `adventurepersoncontacts`.`fullname`
   )`F_0`
) `T_0` on ((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` )
AND (`T_0`.`R_N_0` <= 5))


I have found other way for getting this hierarchical  query.

I have created a temp table ( The table refers  inner sub query(refer
previous query) )

 "CREATE external table IF NOT EXISTS  temp_table  AS SELECT
SUM(adventurepersoncontacts.contactid) as contactid,
adventurepersoncontacts.fullname as fullname FROM
`default`.`adventurepersoncontacts` AS adventurepersoncontacts GROUP BY
`adventurepersoncontacts`.`fullname`";



"SELECT  SUM(`adventurepersoncontacts`.`contactid`) AS
`adventurepersoncontacts_contactid` ,`adventurepersoncontacts`.`fullname`
AS `adventurepersoncontacts_fullname` FROM
`default`.`adventurepersoncontacts` AS adventurepersoncontacts JOIN (SELECT
`fullname` AS `fullname_0_0`,ROW_NUMBER() over( order by `contactid` desc)
AS `R_N_0` FROM default.temp_table) `T_0` on
((`adventurepersoncontacts`.`fullname` = `T_0`.`fullname_0_0` ) AND
(`T_0`.`R_N_0` <= 2)) GROUP BY `adventurepersoncontacts`.`fullname`";

This is other way. But this execution time was delayed when using create
table using select statement


Please help any other  optimized way to achieve my requirement.


Regards,
Ravi

Mime
View raw message