spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: Is there such thing as cache fusion with the underlying tables/files on HDFS
Date Sun, 18 Sep 2016 06:37:25 GMT
In Tableau you can use the in-memory facilities of the Tableau server.

As said, Apache Ignite could be one way. You can also use it to make Hive tables in-memory.
While reducing IO can make sense, I do not think you will receive in production systems so
much difference (at least not 20x). If the data is processed in parallel then IO will be done
in parallel thanks to the architecture of HDFS. Oracle Exadata exploits similar concepts.
The advantage of Ignite compared to e.g.Exadata would be that you have also the indexes of
ORC and Parquet in-memory which avoids reading data in-memory that is not needed for the query.
That being said, even if you use in-memory it still makes sense that some data is pre-aggregated/calculated
for the users based on their needs.

> On 17 Sep 2016, at 18:53, Mich Talebzadeh <> wrote:
> Hi,
> I am seeing similar issues when I was working on Oracle with Tableau as the dashboard.
> Currently I have a batch layer that gets streaming data from
> source -> Kafka -> Flume -> HDFS
> It stored on HDFS as text files and a cron process sinks Hive table with the the external
table build on the directory. I tried both ORC and Parquet but I don't think the query itself
is the issue.
> Meaning it does not matter how clever your execution engine is, the fact you still have
to do  considerable amount of Physical IO (PIO) as opposed to Logical IO (LIO) to get the
data to Zeppelin is on the critical path.
> One option is to limit the amount of data in Zeppelin to certain number of rows or something
similar. However, you cannot tell a user he/she cannot see the full data.
> We resolved this with Oracle by using Oracle TimesTen IMDB to cache certain tables in
memory and get them refreshed (depending on refresh frequency) from the underlying table in
Oracle when data is updated). That is done through cache fusion.
> I was looking around and came across Alluxio. Ideally I like to utilise such concept
like TimesTen. Can one distribute Hive table data (or any table data) across the nodes cached.
In that case we will be doing Logical IO which is about 20 times or more lightweight compared
to Physical IO.
> Anyway this is the concept.
> Thanks
> Dr Mich Talebzadeh
> LinkedIn
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.

View raw message