flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Bortoli <s.bort...@gmail.com>
Subject PartitionByHash and usage of KeySelector
Date Thu, 06 Nov 2014 08:19:28 GMT
Hi all,

I am moving my first steps into becoming an Apache Flink user! I have
configured and run some simple jobs on a small cluster, and everything
worked quite fine so far.

What I am trying to do right now is to run a duplication detection task on
dataset of about 9.5M records. The records are well structured, and
therefore we can exploit the semantic of attributes to narrow down
expensive match executions.

My idea is the following:
1. partition the dataset according to a macro-parameter written in the
record. This allows me to get to 7 partitions of different sizes but also
certainly disjoint. I do that by filtering on a specific type.
2. create partitions of each of the partitions created in step 1 based on
some simple similarity that would reduce the number of expensive function.
I would like to do that by using partitionByHash and KeySelector.
3. compute Cross product for each of the partitions defined in step 2;
4. filter each pair of the cross product by applying an expensive boolean
matching function. Only positive matching duplicates will be retained.

Currently I am working on the step 2, and I have some problems
understanding how to use the partitionByHash function. The main problem is
that I need to have a 'rich key' to support partition, and I discovered the
ExpressionKeys that would allow me to define hash keys with sets of Strings
I can collect from the record. However, the partitionByHash function does
not allow to use these objects as the hash must implement comparable.

So, here is my question: how can I partition considering hash keys of more
than one String?

Is there a better strategy to implement a de-duplication using Flink?

thanks a lot for your support.

kind regards,

Stefano Bortoli, PhD

*ENS Technical Director *_______________________________________________
*OKKAM**Srl **- www.okkam.it <http://www.okkam.it/>*

*Email:* bortoli@okkam.it

*Phone nr: +39 0461 1823912 <%2B39%200461%201823912> *

*Headquarters:* Trento (Italy), Via Trener 8
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.

View raw message