spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject HyperLogLogUDT
Date Tue, 23 Jun 2015 09:19:13 GMT
Hey Spark devs

I've been looking at DF UDFs and UDAFs. The approx distinct is using
hyperloglog,
but there is only an option to return the count as a Long.

It can be useful to be able to return and store the actual data structure
(ie serialized HLL). This effectively allows one to do aggregation /
rollups over columns while still preserving the ability to get distinct
counts.

For example, one can store daily aggregates of events, grouped by various
columns, while storing for each grouping the HLL of say unique users. So
you can get the uniques per day directly but could also very easily do
arbitrary aggregates (say monthly, annually) and still be able to get a
unique count for that period by merging the daily HLLS.

I did this a while back as a Hive UDAF (https://github.com/MLnick/hive-udf)
which returns a Struct field containing a "cardinality" field and a
"binary" field containing the serialized HLL.

I was wondering if there would be interest in something like this? I am not
so clear on how UDTs work with regards to SerDe - so could one adapt the
HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
count as a field? Then I assume this would automatically play nicely with
DataFrame I/O etc. The gotcha is one needs to then call
"approx_count_field.count" (or is there a concept of a "default field" for
a Struct?).

Also, being able to provide the bitsize parameter may be useful...

The same thinking would apply potentially to other approximate (and
mergeable) data structures like T-Digest and maybe CMS.

Nick

Mime
View raw message