spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <>
Subject Near Real time analytics with Spark and tokenization
Date Sun, 15 Oct 2017 07:15:30 GMT

When doing micro-batch streaming of trade data we need to tokenization
certain columns before data lands in Hbase with Lambda architecture.

There are two ways of tokenizing data, vault based and vault less using
something like Protegrity tokenization.

The vault-based tokenization requires clear text and token values to be
stored in a vault say Hbase and crucially the vault cannot be on the same
Hadoop cluster that we are processing real time. It could be in another
Hadoop cluster for tokenization.

This causes latency for real time analytics when token values have to be
calculated and then stored in remote Hbase vault.

What is the general approach to this type of issue. It seems to be based to
use vault-less tokenization?


Dr Mich Talebzadeh

LinkedIn *

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

View raw message