BTW, a tool that I have been using to help do the preaggregation of data using hyperloglog in combination with Spark is atscale (  It builds the aggregations and makes use of the speed of SparkSQL - all within the context of a model that is accessible by Tableau or Qlik.  

On Thu, Mar 26, 2015 at 8:55 AM Jörn Franke <> wrote:

As I wrote previously - indexing is not your only choice, you can preaggregate data during load or depending on your needs you  need to think about other data structures, such as graphs, hyperloglog, bloom filters etc. (challenge to integrate in standard bi tools)

Le 26 mars 2015 13:34, "kundan kumar" <> a écrit :

I was looking for some options and came across JethroData.

This stores the data maintaining indexes over all the columns seems good and claims to have better performance than Impala.

Earlier I had tried Apache Phoenix because of its secondary indexing feature. But the major challenge I faced there was, secondary indexing was not supported for bulk loading process.
Only the sequential loading process supported the secondary indexes, which took longer time.

Any comments on this ?

On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar <> wrote:
I looking for some options and came across 

On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke <> wrote:

You can also preaggregate results for the queries by the user - depending on what queries they use this might be necessary for any underlying technology

Le 26 mars 2015 11:27, "kundan kumar" <> a écrit :


I need to store terabytes of data which will be used for BI tools like qlikview.

The queries can be on the basis of filter on any column.

Currently, we are using redshift for this purpose.

I am trying to explore things other than the redshift .

Is it possible to gain better performance in spark as compared to redshift ?

If yes, please suggest what is the best way to achieve this.