spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: How to filter based on a constant value
Date Sun, 31 Jul 2016 09:56:11 GMT
The point is, window functions are supposed designed to be faster by doing
the calculations in one pass, instead of 2 pass in case of max.

DF supports window functions (using sql.Window) so instead of writing sql,
you can use it as well.

Best
Ayan

On Sun, Jul 31, 2016 at 7:48 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> yes reserved word issue thanks
>
> hive> select *
>     > from (select transactiondate, transactiondescription, debitamount
>     > , rank() over (order by transactiondate desc) r
>     > from accounts.ll_18740868 where transactiondescription like
> '%HARRODS%'
>     >  ) RS
>     > where r=1
>     > ;
> Query ID = hduser_20160731104724_f8e5f426-770a-49fc-a4a5-f0f645c06e8c
> Total jobs = 1
> Launching Job 1 out of 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Spark Job = 7727d5df-ccf9-4f98-8563-1cdec2634d99
> Query Hive on Spark job[0] stages:
> 0
> 1
> Status: Running (Hive on Spark job[0])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-31 10:48:28,726 Stage-0_0: 0/1  Stage-1_0: 0/1
> 2016-07-31 10:48:31,750 Stage-0_0: 0/1  Stage-1_0: 0/1
> 2016-07-31 10:48:32,758 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1
> 2016-07-31 10:48:34,772 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
> 2016-07-31 10:48:35,780 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
> Status: Finished successfully in 10.10 seconds
> OK
> 2015-12-15      HARRODS LTD CD 4636     10.95   1
> Time taken: 46.546 seconds, Fetched: 1 row(s)
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 31 July 2016 at 10:36, ayan guha <guha.ayan@gmail.com> wrote:
>
>> I think the word "INNER" is reserved in Hive. Please change the alias to
>> something else.
>>
>> Not sure about scala, but essentially it is string replacement.
>>
>> On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> thanks how about scala?
>>>
>>> BTW the same analytic code fails in Hive itself:(
>>>
>>> hive> select *
>>>     > from (select transactiondate, transactiondescription, debitamount
>>>     > from (select transactiondate, transactiondescription, debitamount
>>>     > , rank() over (order by transactiondate desc) r
>>>     > from ll_18740868 where transactiondescription like '%XYZ%'
>>>     >      ) inner
>>>     > where r=1
>>>     > ;
>>>
>>> FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11833)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:47987)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5520)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:5492)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource0(HiveParser_FromClauseParser.java:3918)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3818)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1909)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1546)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:48001)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:42252)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:42138)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:41154)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:41024)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1653)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1137)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:446)
>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:319)
>>>         at
>>> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1255)
>>>         at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1301)
>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
>>>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
>>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>> *FAILED: ParseException line 6:7 Failed to recognize predicate 'inner'.
>>> Failed rule: 'identifier' in subquery source*
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 31 July 2016 at 10:21, ayan guha <guha.ayan@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> This is because Spark does  not provide a way to "bind" variables like
>>>> Oracle does.
>>>>
>>>> So you can build the sql string, like below (in python)
>>>>
>>>> val = 'XYZ'
>>>> sqlbase = "select ..... where col = '<val>'".replace('<val>,val)
>>>>
>>>>
>>>>
>>>> On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Thanks Ayan.
>>>>>
>>>>> This is the one I used
>>>>>
>>>>> scala> sqltext = """
>>>>>      |  select *
>>>>>      | from (select transactiondate, transactiondescription,
>>>>> debitamount
>>>>>      | , rank() over (order by transactiondate desc) r
>>>>>      | from ll_18740868 where transactiondescription like '%XYZ%'
>>>>>      |       ) inner
>>>>>      |  where r=1
>>>>>      |    """
>>>>>
>>>>> scala> HiveContext.sql(sqltext).show
>>>>> +---------------+----------------------+-----------+---+
>>>>> |transactiondate|transactiondescription|debitamount|  r|
>>>>> +---------------+----------------------+-----------+---+
>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|  1|
>>>>> +---------------+----------------------+-----------+---+
>>>>>
>>>>> The issue I see is that in SQL here I cannot pass HASHTAG as a
>>>>> variable to SQL. For example in RDBMS I can do this
>>>>>
>>>>> 1> declare @pattern varchar(50)
>>>>> 2> set @pattern = 'Direct'
>>>>> 3> select CHANNEL_DESC from CHANNELS where CHANNEL_DESC like
>>>>> '%'||@pattern||'%'
>>>>> 4> go
>>>>> (1 row affected)
>>>>>  CHANNEL_DESC
>>>>>  --------------------
>>>>>  Direct Sales
>>>>>
>>>>> but not in Hive or Spark SQL
>>>>>
>>>>> whereas with FP it does it implicitly.
>>>>>
>>>>> col("CHANNELS").contains(HASHTAG))
>>>>>
>>>>> Unless there is a way of doing it?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 31 July 2016 at 01:20, ayan guha <guha.ayan@gmail.com> wrote:
>>>>>
>>>>>> select *
>>>>>> from (select *,
>>>>>>              rank() over (order by transactiondate) r
>>>>>>        from ll_18740868 where transactiondescription='XYZ'
>>>>>>       ) inner
>>>>>> where r=1
>>>>>>
>>>>>> Hi Mitch,
>>>>>>
>>>>>> If using SQL is fine, you can try the code above. You need to
>>>>>> register ll_18740868  as temp table.
>>>>>>
>>>>>> On Sun, Jul 31, 2016 at 6:49 AM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I would like to find out when it was the last time I paid a company
>>>>>>> with Debit Card
>>>>>>>
>>>>>>>
>>>>>>> This is the way I do it.
>>>>>>>
>>>>>>> 1) Find the date when I paid last
>>>>>>> 2) Find the rest of details from the row(s)
>>>>>>>
>>>>>>> So
>>>>>>>
>>>>>>> var HASHTAG = "XYZ"
>>>>>>> scala> var maxdate =
>>>>>>> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0)
>>>>>>> maxdate: org.apache.spark.sql.Row = [2015-12-15]
>>>>>>>
>>>>>>> OK so it was 2015-12-15
>>>>>>>
>>>>>>>
>>>>>>> Now I want to get the rest of the columns. This one works when
I
>>>>>>> hard code the maxdate!
>>>>>>>
>>>>>>>
>>>>>>> scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG)
>>>>>>> && col("transactiondate") === "2015-12-15").select("transactiondate",
>>>>>>> "transactiondescription", "debitamount").show
>>>>>>> +---------------+----------------------+-----------+
>>>>>>> |transactiondate|transactiondescription|debitamount|
>>>>>>> +---------------+----------------------+-----------+
>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |      10.95|
>>>>>>> +---------------+----------------------+-----------+
>>>>>>>
>>>>>>> Now if I want to use the var maxdate in place of "2015-12-15",
how
>>>>>>> would I do that?
>>>>>>>
>>>>>>> I tried lit(maxdate) etc but they are all giving me error?
>>>>>>>
>>>>>>> java.lang.RuntimeException: Unsupported literal type class
>>>>>>> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>>>>>>> [2015-12-15]
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Mime
View raw message