spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Spark Dataframe 1.4 (GroupBy partial match)
Date Wed, 01 Jul 2015 19:19:25 GMT
You should probably write a UDF that uses regular expression or other
string munging to canonicalize the subject and then group on that derived
column.

On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya <surajshetiya@gmail.com>
wrote:

> Thanks Salih. :)
>
>
> The output of the groupby is as below.
>
> 2015-01-14      "SEC Inquiry"
> 2015-01-16       "Re: SEC Inquiry"
> 2015-01-18       "Fwd: Re: SEC Inquiry"
>
>
> And subsequently, we would like to aggregate all messages with a
> particular reference subject.
> For instance the question we are trying to answer could be : Get the count
> of messages with a particular subject.
>
> Looking forward to any suggestion from you.
>
>
> On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop <soztop@yahoo.com> wrote:
>
>> Hi Suraj
>> What will be your output after group by? Since GroupBy is for
>> aggregations like sum, count etc.
>> If you want to count the 2015 records than it is possible.
>>
>> Kind Regards
>> Salih Oztop
>>
>>
>>   ------------------------------
>>  *From:* Suraj Shetiya <surajshetiya@gmail.com>
>> *To:* user@spark.apache.org
>> *Sent:* Tuesday, June 30, 2015 3:05 PM
>> *Subject:* Spark Dataframe 1.4 (GroupBy partial match)
>>
>> I have a dataset (trimmed and simplified) with 2 columns as below.
>>
>> Date                Subject
>> 2015-01-14      "SEC Inquiry"
>> 2014-02-12       "Happy birthday"
>> 2014-02-13       "Re: Happy birthday"
>> 2015-01-16       "Re: SEC Inquiry"
>> 2015-01-18       "Fwd: Re: SEC Inquiry"
>>
>> I have imported the same in a Spark Dataframe. What I am looking at is
>> groupBy subject field (however, I need a partial match to identify the
>> discussion topic).
>>
>> For example in the above case.. I would like to group all messages, which
>> have subject containing "SEC Inquiry" which returns following grouped
>> frame:
>>
>> 2015-01-14      "SEC Inquiry"
>> 2015-01-16       "Re: SEC Inquiry"
>> 2015-01-18       "Fwd: Re: SEC Inquiry"
>>
>> Another usecase for a similar problem could be group by year (in the
>> above example), it would mean partial match of the date field, which would
>> mean groupBy Date by matching year as "2014" or "2015".
>>
>> Keenly Looking forward to reply/solution to the above.
>>
>> - Suraj
>>
>>
>>
>>
>>
>

Mime
View raw message