spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Shetiya <surajshet...@gmail.com>
Subject Re: Spark Dataframe 1.4 (GroupBy partial match)
Date Fri, 03 Jul 2015 06:12:13 GMT
Hi Salih,

Thanks for the links :) This seems very promising to me.

When do you think this would be available in the spark codeline ?

Thanks,
Suraj

On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop <soztop@yahoo.com> wrote:

> Hi Suraj,
> It seems your requirement is Record Linkage/Entity Resolution.
> https://en.wikipedia.org/wiki/Record_linkage
> http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
>
> A presentation from Spark Summit using GraphX
>
> https://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark
>
>
> Kind Regards
> Salih Oztop
> 07856128843
> http://www.linkedin.com/in/salihoztop
>
>   ------------------------------
>  *From:* Suraj Shetiya <surajshetiya@gmail.com>
> *To:* Michael Armbrust <michael@databricks.com>
> *Cc:* Salih Oztop <soztop@yahoo.com>; "user@spark.apache.org" <
> user@spark.apache.org>; megha.sridhara@cynepia.com
> *Sent:* Thursday, July 2, 2015 10:47 AM
>
> *Subject:* Re: Spark Dataframe 1.4 (GroupBy partial match)
>
> Hi Michael,
>
> Thanks for a quick response.. This sounds like something that would work.
> However, Rethinking the problem statement and various other use cases,
> which are growing, there are more such scenarios, where one could have
> columns with structured and unstructured data embedded (json or xml or
> other kind of collections), it may make sense to allow probabilistic
> groupby operations where the user can get the same functionality in one
> step instead of two..
>
> Your thoughts on if that makes sense..
>
> -Suraj
>
>
>
>
> ---------- Forwarded message ----------
> From: "Michael Armbrust" <michael@databricks.com>
> Date: Jul 2, 2015 12:49 AM
> Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
> To: "Suraj Shetiya" <surajshetiya@gmail.com>
> Cc: "Salih Oztop" <soztop@yahoo.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
> You should probably write a UDF that uses regular expression or other
> string munging to canonicalize the subject and then group on that derived
> column.
>
> On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya <surajshetiya@gmail.com>
> wrote:
>
> Thanks Salih. :)
>
>
> The output of the groupby is as below.
>
> 2015-01-14      "SEC Inquiry"
> 2015-01-16       "Re: SEC Inquiry"
> 2015-01-18       "Fwd: Re: SEC Inquiry"
>
>
> And subsequently, we would like to aggregate all messages with a
> particular reference subject.
> For instance the question we are trying to answer could be : Get the count
> of messages with a particular subject.
>
> Looking forward to any suggestion from you.
>
>
> On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop <soztop@yahoo.com> wrote:
>
> Hi Suraj
> What will be your output after group by? Since GroupBy is for aggregations
> like sum, count etc.
> If you want to count the 2015 records than it is possible.
>
> Kind Regards
> Salih Oztop
>
>
>   ------------------------------
>  *From:* Suraj Shetiya <surajshetiya@gmail.com>
> *To:* user@spark.apache.org
> *Sent:* Tuesday, June 30, 2015 3:05 PM
> *Subject:* Spark Dataframe 1.4 (GroupBy partial match)
>
> I have a dataset (trimmed and simplified) with 2 columns as below.
>
> Date                Subject
> 2015-01-14      "SEC Inquiry"
> 2014-02-12       "Happy birthday"
> 2014-02-13       "Re: Happy birthday"
> 2015-01-16       "Re: SEC Inquiry"
> 2015-01-18       "Fwd: Re: SEC Inquiry"
>
> I have imported the same in a Spark Dataframe. What I am looking at is
> groupBy subject field (however, I need a partial match to identify the
> discussion topic).
>
> For example in the above case.. I would like to group all messages, which
> have subject containing "SEC Inquiry" which returns following grouped
> frame:
>
> 2015-01-14      "SEC Inquiry"
> 2015-01-16       "Re: SEC Inquiry"
> 2015-01-18       "Fwd: Re: SEC Inquiry"
>
> Another usecase for a similar problem could be group by year (in the above
> example), it would mean partial match of the date field, which would mean
> groupBy Date by matching year as "2014" or "2015".
>
> Keenly Looking forward to reply/solution to the above.
>
> - Suraj
>
>
>
>
>
>
>
>
>


-- 
Regards,
Suraj

Mime
View raw message