spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Is there such thing as cache fusion with the underlying tables/files on HDFS
Date Sun, 18 Sep 2016 10:08:30 GMT
Ignite has a special cache for HDFS data (which is not a Java cache), for rdds etc. So you
are right it is in this sense very different.

Besides caching, from what I see from data scientists is that for interactive queries and
models evaluation they anyway do not browse the complete data. Even with in-memory solutions
this is painful slow if you receive several TB of data by hour. 

What they do is sampling, e.g.select relevant small subset of data, evaluate several different
models on the sampled data in "real time" and then calculate the winning model as batch later.


Additionally probabilistic data structures are employed in some cases. For example if you
want to count the number of unique viewers of a web site it does not make sense to browse
through the logs for userids all the time, by  employ a hyperloglog structure which needs
little money and can be accessed in real time.

For the case of visualizations, I think in the area of big data it makes also very sense to
visualize aggregations based on sampling. If you need really the last 0,0001% of precision
then you can click on the visualization and the system takes some time to calculate it.

> On 18 Sep 2016, at 10:54, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Thanks everyone for ideas.
> 
> Sounds like Ignite has been taken by GridGain  so becomes similar to HazelCast open source
by name only. However, an in-memory Java Cache may or may not help.
> 
> The other options like faster databases are on the table depending who wants what (that
are normally decisions that includes more than technical criteria). Example if the customer
already had Tableau, persuading them to go for QlickView (as an example) may not work.
> 
> So my view is to build the batch layer foundation and leave these finer choices to the
customer. We will offer Zeppelin with Parquet and ORC with a certain refresh of these tables
and let the customer decide. I stand corrected otherwise.
> 
> BTW I did these simple test on using Zeppelin (running on Spark Standalone mode)
> 
> 1) Read data using Spark sql from Flume text files on HDFS (real time)
> 2) Read data using Spark sql from ORC table in Hive (lagging by 15 min)
> 3) Read data using Spark sql from Parquet table in Hive(lagging by 15 min)
> 
> Timings
> 
> 1)            2 min, 16 sec
> 2)            1 min, 1 sec 
> 3)            1 min, 6 sec
> 
> So unless one splits the atom, ORC or Parquet on Hive look similar performance.
> 
> In all probability customer has a data warehouse that use Tableau or QlikView or similar.
Their BAs will carry on using these tools. If they have data scientist then they will either
use R that has in built UI or can use Spark sql with Zeppelin. Also one can fire Zeppelin
on each node of Spark or even on the same node with different Port. Then of coursed one has
to think about adequate response in a concurrent environment.
> 
> Cheers
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  
> 
>> On 18 September 2016 at 08:52, Sean Owen <sowen@cloudera.com> wrote:
>> Alluxio isn't a database though; it's storage. I may be still harping
>> on the wrong solution for you, but as we discussed offline, that's
>> also what Impala, Drill et al are for.
>> 
>> Sorry if this was mentioned before but Ignite is what GridGain became,
>> if that helps.
>> 
>> On Sat, Sep 17, 2016 at 11:00 PM, Mich Talebzadeh
>> <mich.talebzadeh@gmail.com> wrote:
>> > Thanks Todd
>> >
>> > As I thought Apache Ignite is a data fabric much like Oracle Coherence cache
>> > or HazelCast.
>> >
>> > The use case is different between an in-memory-database (IMDB) and Data
>> > Fabric. The build that I am dealing with has a 'database centric' view of
>> > its data (i.e. it accesses its data using Spark sql and JDBC) so an
>> > in-memory database will be a better fit. On the other hand If the
>> > application deals solely with Java objects and does not have any notion of a
>> > 'database', does not need SQL style queries and really just wants a
>> > distributed, high performance object storage grid, then I think Ignite would
>> > likely be the preferred choice.
>> >
>> > So will likely go if needed for an in-memory database like Alluxio. I have
>> > seen a rather debatable comparison between Spark and Ignite that looks to be
>> > like a one sided rant.
>> >
>> > HTH
>> >
>> >
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may arise
>> > from relying on this email's technical content is explicitly disclaimed. The
>> > author will in no case be liable for any monetary damages arising from such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
> 

Mime
View raw message