spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Shetiya <surajshet...@gmail.com>
Subject Re: Spark Dataframe 1.4 (GroupBy partial match)
Date Thu, 02 Jul 2015 09:47:44 GMT
Hi Michael,

Thanks for a quick response.. This sounds like something that would work.
However, Rethinking the problem statement and various other use cases,
which are growing, there are more such scenarios, where one could have
columns with structured and unstructured data embedded (json or xml or
other kind of collections), it may make sense to allow probabilistic
groupby operations where the user can get the same functionality in one
step instead of two..

Your thoughts on if that makes sense..

-Suraj


---------- Forwarded message ----------
From: "Michael Armbrust" <michael@databricks.com>
Date: Jul 2, 2015 12:49 AM
Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
To: "Suraj Shetiya" <surajshetiya@gmail.com>
Cc: "Salih Oztop" <soztop@yahoo.com>, "user@spark.apache.org" <
user@spark.apache.org>

You should probably write a UDF that uses regular expression or other
string munging to canonicalize the subject and then group on that derived
column.

On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya <surajshetiya@gmail.com>
wrote:

> Thanks Salih. :)
>
>
> The output of the groupby is as below.
>
> 2015-01-14      "SEC Inquiry"
> 2015-01-16       "Re: SEC Inquiry"
> 2015-01-18       "Fwd: Re: SEC Inquiry"
>
>
> And subsequently, we would like to aggregate all messages with a
> particular reference subject.
> For instance the question we are trying to answer could be : Get the count
> of messages with a particular subject.
>
> Looking forward to any suggestion from you.
>
>
> On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop <soztop@yahoo.com> wrote:
>
>> Hi Suraj
>> What will be your output after group by? Since GroupBy is for
>> aggregations like sum, count etc.
>> If you want to count the 2015 records than it is possible.
>>
>> Kind Regards
>> Salih Oztop
>>
>>
>>   ------------------------------
>>  *From:* Suraj Shetiya <surajshetiya@gmail.com>
>> *To:* user@spark.apache.org
>> *Sent:* Tuesday, June 30, 2015 3:05 PM
>> *Subject:* Spark Dataframe 1.4 (GroupBy partial match)
>>
>> I have a dataset (trimmed and simplified) with 2 columns as below.
>>
>> Date                Subject
>> 2015-01-14      "SEC Inquiry"
>> 2014-02-12       "Happy birthday"
>> 2014-02-13       "Re: Happy birthday"
>> 2015-01-16       "Re: SEC Inquiry"
>> 2015-01-18       "Fwd: Re: SEC Inquiry"
>>
>> I have imported the same in a Spark Dataframe. What I am looking at is
>> groupBy subject field (however, I need a partial match to identify the
>> discussion topic).
>>
>> For example in the above case.. I would like to group all messages, which
>> have subject containing "SEC Inquiry" which returns following grouped
>> frame:
>>
>> 2015-01-14      "SEC Inquiry"
>> 2015-01-16       "Re: SEC Inquiry"
>> 2015-01-18       "Fwd: Re: SEC Inquiry"
>>
>> Another usecase for a similar problem could be group by year (in the
>> above example), it would mean partial match of the date field, which would
>> mean groupBy Date by matching year as "2014" or "2015".
>>
>> Keenly Looking forward to reply/solution to the above.
>>
>> - Suraj
>>
>>
>>
>>
>>
>

Mime
View raw message