hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Using SPARQL against HBase
Date Mon, 05 Apr 2010 19:58:59 GMT
Just some ideas, possibly half-baked:

> From: Amandeep Khurana
> Subject: Re: Using SPARQL against HBase
> To: hbase-user@hadoop.apache.org
> 1. We want to have a SPARQL query engine over it that can return
> results to queries in real time, comparable to other systems out
> there. And since we will have HBase as the storage layer, we want
> to scale well.

Generally, I wonder if HBase may be able to trade disk space for query processing time for
expected common queries. 

So part of the story here could be using coprocessors (HBASE-2000) as a mapping layer between
the clients and the plain/simple BigTable store. For example, an RDF and graph relation aware
coprocessor could produce and cache projections on the fly and use structure aware data placement
strategies for sharding -- so the table or tables exposed to the client for enabling queries
may be only a logical construct backed by one or more real tables with far different structure,
and there would be intelligence for managing the construct running within the regionservers.
Projections could be built lazily (via interprocess BSP?), triggered by a new query or an
admin action. (And possibly the results could be cached with TTLs for automatic garbage collection
for managing the total size of the store.)

This opens up a range of implementation options that the basic BigTable architecture would
not support. This is like installing a purpose-built RDF store within an existing HBase+Hadoop

> 2. We want to enable large scale processing as well,
> leveraging Hadoop (maybe? read about this on Cloudera's blog),
> and maybe something like Pregel.

Edward, didn't you do some work implementing graph operations using BSP message passing within
the Hadoop framework? What were your findings? 

I think a coprocessor could implement a Pregel-like distributed graph processing model internally
to the region servers, using ZooKeeper primitives for rendezvous. 

> These things are fluid and the first step would be to spec
> out features that we want to build in

In my opinion as a potential user of such a service, the design priorities should be something

1) Scale.

2) Real time queries.

3) Support a reasonable subset of possible queries over the data.

Obviously both #1 and #2 are in tension with #3, so some expressiveness could be sacrificed.

#1 and #2 are in tension as well. It would not be desirable to provide for all possible queries
to be returned in real time given the cost of that is an unsupportable space explosion. 

My rationale for the above is a BigTable hosted RDF store could have less expressiveness than
alternatives but that would be acceptable if the reason for considering the solution is the
'Big' in BigTable. But this is not the only consideration. Also if it can be fast for the
common cases even with moderately sized data, it is a good alternative and may be already
installed as part of a larger strategy employing the Hadoop stack. 

We should consider a motivating use case, or a few of them. 

For me, I'd like a canonical source of provenance. We have a patchwork of tracking systems.
I'd like to be able to link the provenance for all of our workflows and data, inputs and outputs
at each stage. Should support fast queries for weighting inputs to predictive models. Should
support bulk queries also, so as we assess or reassess the reliability and trustworthiness
of a source or service we would be able to trace all data and all conclusions contributed
by the entity and all that build upon it -- the whole cascade of it -- by following the linkage.
We would be able to invalidate any conclusions based on data or process we deem (at some arbitrary
time) flawed or untrustworthy. This "provenance store" would be a new metaindex over several
workflows and data islands. 



Deletions would be rare, if ever. 

   - Andy


View raw message