kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "claude.warren@wipro.com.INVALID" <claude.war...@wipro.com.INVALID>
Subject Re: Unique users per calendar month using kafka streams
Date Thu, 21 Nov 2019 11:51:58 GMT
A different approach would be to integrate the Apache DataSketches  (https://datasketches.apache.org/)
which have mathematical proofs behind them.  Using a DataSketch you can capture unique members
for any given time period in a very small data object and be able to aggregate them (even
though unique counts are not in and of themselves aggregateable).  For example you could take
the monthly measures and calculate the unique users per quarter or for the entire year very
quickly.  Generally orders of magnitude faster.

________________________________
From: Bruno Cadonna <bruno@confluent.io>
Sent: Thursday, November 21, 2019 11:37
To: Users <users@kafka.apache.org>
Subject: Re: Unique users per calendar month using kafka streams

** This mail has been sent from an external source. Treat hyperlinks and attachments in this
email with caution**

Hi Chintan,

You cannot specify time windows based on a calendar object like months.

In the following, I suppose the keys of your records are user IDs. You
could extract the months from the timestamps of the events and add
them to the key of your records. Then you can group the records by key
and count them. Be aware that your state that stores the counts will
grow indefinitely and therefore you need to take care how to remove
counts you do not need anymore from your local state.

Take a look at the following example of how to deduplicate records

https://clicktime.symantec.com/3E6BmtgzXaCnuSmDcxKqdKD7Vc?u=https%3A%2F%2Fgithub.com%2Fconfluentinc%2Fkafka-streams-examples%2Fblob%2F5.3.1-post%2Fsrc%2Ftest%2Fjava%2Fio%2Fconfluent%2Fexamples%2Fstreams%2FEventDeduplicationLambdaIntegrationTest.java

It shows how to avoid indefinite growing of local store in such cases.
Try to adapt it to solve your problem by extending the key with the
month and computing the count instead of looking for duplicates.

Best,
Bruno

On Thu, Nov 21, 2019 at 10:28 AM chintan mavawala
<chintan25487@gmail.com> wrote:
>
> Hi,
>
> We have a use case to capture number of unique users per month. We planned
> to use windowing concept for this.
>
> For example, group events from input topic by user name and later sub group
> them based on time window. However i don't see how i can sub group the
> results based on particular month, say January. The only way is sub group
> based on time.
>
> Any pointers would be appreciated.
>
> Regards,
> Chintan
The information contained in this electronic message and any attachments to this message are
intended for the exclusive use of the addressee(s) and may contain proprietary, confidential
or privileged information. If you are not the intended recipient, you should not disseminate,
distribute or copy this e-mail. Please notify the sender immediately and destroy all copies
of this message and any attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the presence of viruses. The
company accepts no liability for any damage caused by any virus transmitted by this email.
www.wipro.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message