spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Extracting key word from a textual column
Date Wed, 03 Aug 2016 05:32:58 GMT
I agree with you.

> On 03 Aug 2016, at 01:20, ayan guha <guha.ayan@gmail.com> wrote:
> 
> I would stay away from transaction tables until they are fully baked. I do not see why
you need to update vs keep inserting with timestamp and while joining derive latest value
on the fly.
> 
> But I guess it has became a religious question now :) and I am not unbiased.
> 
>> On 3 Aug 2016 08:51, "Mich Talebzadeh" <mich.talebzadeh@gmail.com> wrote:
>> There are many ways of addressing this issue.
>> 
>> Using Hbase with Phoenix adds another layer to the stack which is not necessary for
handful of table and will add to cost (someone else has to know about Hbase, Phoenix etc.
(BTW I would rather work directly on Hbase table. It is faster)
>> 
>> There may be say 100 new entries into this catalog table with multiple updates (not
a single DML) to get hashtag right. sometimes it is an iterative process which results in
many deltas.
>> 
>> If that is needed done once a day or on demand, an alternative would be to insert
overwrite the transactional hive table with deltas into a text table in Hive and present that
one to Spark. This allows Spark to see the data.
>> 
>> Remember if I use Hive to do the analytics/windowing, there is no issue. The issue
is with Spark that neither Spark SQL or Spark shell can use that table.
>> 
>> Sounds like an issue for Spark to resolve later.
>> 
>> Another alternative one can leave the transactional table in RDBMS for this purpose
and load it into DF through JDBC interface. It works fine and pretty fast.
>> 
>> Again these are all workarounds. I discussed this in Hive forum. There should be
a way" to manually compact a transactional table in Hive" (not possible now) and second point
if Hive can see the data in Hive table, why not Spark?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>  
>> 
>>> On 2 August 2016 at 23:10, Ted Yu <yuzhihong@gmail.com> wrote:
>>> +1
>>> 
>>>> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>>>> 
>>>> If you need to use single inserts, updates, deletes, select why not use hbase
with Phoenix? I see it as complementary to the hive / warehouse offering 
>>>> 
>>>>> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I decided to create a catalog table in Hive ORC and transactional. That
table has two columns of value
>>>>> 
>>>>> transactiondescription === account_table.transactiondescription
>>>>> hashtag String column created from a semi automated process of deriving
it from account_table.transactiondescription
>>>>> Once the process is complete in populating the catalog table then we
just need to create a new DF based on join between catalog table and the account_table. The
join will use hashtag in catalog table to loop over debit column in account_table for a given
hashtag. That is pretty fast as going through pattern matching is pretty intensive in any
application and database in real time.
>>>>> 
>>>>> So one can build up the catalog table over time as a reference table.
I am sure such tables exist in commercial world.
>>>>> 
>>>>> Anyway after getting results out I know how I am wasting my money on
different things, especially on clothing  etc :)
>>>>> 
>>>>> 
>>>>> HTH
>>>>> 
>>>>> P.S. Also there is an issue with Spark not being able to read data through
Hive transactional tables that have not been compacted yet. Spark just crashes. If these tables
need to be updated regularly say catalog table and they are pretty small, one might maintain
them in an RDBMS and read them once through JDBC into a DataFrame in Spark before doing analytics.
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise from relying on
this email's technical content is explicitly disclaimed. The author will in no case be liable
for any monetary damages arising from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoyal4@gmail.com>
wrote:
>>>>>> Hi Mich,
>>>>>> 
>>>>>> It seems like an entity resolution problem - looking at different
representations of an entity - SAINSBURY in this case and matching them all together. How
dirty is your data in the description - are there stop words like SACAT/SMKT etc you can strip
off and get the base retailer entity ?
>>>>>> 
>>>>>> Best Regards,
>>>>>> Sonal
>>>>>> Founder, Nube Technologies 
>>>>>> Reifier at Strata Hadoop World
>>>>>> Reifier at Spark Summit 2015
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> I believe there is some catalog of companies that I can get and
store it in a table and math the company name to transactiondesciption column.
>>>>>>> 
>>>>>>> That catalog should have sectors in it. For example company XYZ
is under Grocers etc which will make search and grouping much easier.
>>>>>>> 
>>>>>>> I believe Spark can do it, though I am generally interested on
alternative ideas.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
for any loss, damage or destruction of data or any other property which may arise from relying
on this email's technical content is explicitly disclaimed. The author will in no case be
liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>>> On 2 August 2016 at 16:26, Yong Zhang <java8964@hotmail.com>
wrote:
>>>>>>>> Well, if you still want to use windows function for your
logic, then you need to derive a new column out, like "catalog", and use it as part of grouping
logic.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Maybe you can use regex for deriving out this new column.
The implementation needs to depend on your data in "transactiondescription", and regex gives
you the most powerful way to handle your data.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This is really not a Spark question, but how to you process
your logic based on the data given.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yong
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>>>>>>> Sent: Tuesday, August 2, 2016 10:00 AM
>>>>>>>> To: user @spark
>>>>>>>> Subject: Extracting key word from a textual column
>>>>>>>>  
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Need some ideas.
>>>>>>>> 
>>>>>>>> Summary:
>>>>>>>> 
>>>>>>>> I am working on a tool to slice and dice the amount of money
I have spent so far (meaning the whole data sample) on a given retailer so I have a better
idea of where I am wasting the money
>>>>>>>> 
>>>>>>>> Approach
>>>>>>>> 
>>>>>>>> Downloaded my bank statements from a given account in csv
format from inception till end of July. Read the data and stored it in ORC table.
>>>>>>>> 
>>>>>>>> I am interested for all bills that I paid using Debit Card
( transactiontype = "DEB") that comes out the account directly. Transactiontype is the three
character code lookup that I download as well.
>>>>>>>> 
>>>>>>>> scala> ll_18740868.printSchema
>>>>>>>> root
>>>>>>>>  |-- transactiondate: date (nullable = true)
>>>>>>>>  |-- transactiontype: string (nullable = true)
>>>>>>>>  |-- sortcode: string (nullable = true)
>>>>>>>>  |-- accountnumber: string (nullable = true)
>>>>>>>>  |-- transactiondescription: string (nullable = true)
>>>>>>>>  |-- debitamount: double (nullable = true)
>>>>>>>>  |-- creditamount: double (nullable = true)
>>>>>>>>  |-- balance: double (nullable = true)
>>>>>>>> 
>>>>>>>> The important fields are transactiondate, transactiontype,
transactiondescription and debitamount
>>>>>>>> 
>>>>>>>> So using analytics. windowing I can do all sorts of things.
For example this one gives me the last time I spent money on retailer XYZ and the amount
>>>>>>>> 
>>>>>>>> SELECT *
>>>>>>>> FROM (
>>>>>>>>       select transactiondate, transactiondescription, debitamount
>>>>>>>>       , rank() over (order by transactiondate desc) AS rank
>>>>>>>>       from accounts.ll_18740868 where transactiondescription
like '%XYZ%'
>>>>>>>>      ) tmp
>>>>>>>> where rank <= 1
>>>>>>>> 
>>>>>>>> And its equivalent using Windowing in FP
>>>>>>>> 
>>>>>>>> import org.apache.spark.sql.expressions.Window
>>>>>>>> val wSpec = Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>>>>>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>>>>>>> 
>>>>>>>> 
>>>>>>>> +---------------+----------------------+----+
>>>>>>>> |transactiondate|transactiondescription|rank|
>>>>>>>> +---------------+----------------------+----+
>>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |   1|
>>>>>>>> +---------------+----------------------+----+
>>>>>>>> 
>>>>>>>> So far so good. But if I want to find all I spent on each
retailer, then it gets trickier as a retailer appears like below in the column transactiondescription:
>>>>>>>> 
>>>>>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>>>>>>>> +----------------------+
>>>>>>>> |transactiondescription|
>>>>>>>> +----------------------+
>>>>>>>> |  SAINSBURYS SMKT C...|
>>>>>>>> |  SACAT SAINSBURYS ...|
>>>>>>>> |  SAINSBURY'S SMKT ...|
>>>>>>>> |  SAINSBURYS S/MKT ...|
>>>>>>>> |  SACAT SAINSBURYS ...|
>>>>>>>> +----------------------+
>>>>>>>> 
>>>>>>>> If I look at them I know they all belong to SAINBURYS (food
retailer). I have done some crude grouping and it works somehow
>>>>>>>> 
>>>>>>>> //define UDF here to handle substring
>>>>>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
s.substring(start, end) }
>>>>>>>> var cutoff = "CD"  // currently used in the statement
>>>>>>>> val wSpec2 = Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
cutoff)-1))
>>>>>>>> ll_18740868.where($"transactiontype" === "DEB" &&
($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)
>>>>>>>> 
>>>>>>>> However, I really need to extract the "keyword" retailer
name from transactiondescription column And I need some ideas about the best way of doing
it. Is this possible in Spark?
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>  
>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>  
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>> 
>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
for any loss, damage or destruction of data or any other property which may arise from relying
on this email's technical content is explicitly disclaimed. The author will in no case be
liable for any monetary damages arising from such loss, damage or destruction.

Mime
View raw message