hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: Using SPARQL against HBase
Date Tue, 06 Apr 2010 05:44:59 GMT
Scaling up is not going to be an issue unless you demand performance (in
terms of low latency of queries) too. Here's a paper that has some ideas on
how we can get good performance on queries in a large scale triple store:

We can use indexing ideas from this paper, combined with coprocessors (which
I'm still not sure how to leverage yet) for fast query performance.

For storing large number of RDF triples, we might not need to add much to
HBase's data model. I'm still thinking of this idea: We could have a few
column families (<10) and hash the predicate value to a column family.

So, predicate1 can go to fam1, making fam1:predicate1, so on and so forth.
We could use ideas from the CRUSH paper [1] for this.

Similarly, if a table is getting too big, we can have multiple tables as
well and hash the subject value to decide the table it should be placed in.


This gives us scale as well as the ability to do fast querying.. Ofcourse,
as Andy mentioned, we'll have to find a subset of queries that we will

[1] http://ceph.newdream.net/papers/weil-crush-sc06.pdf

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Mon, Apr 5, 2010 at 7:55 PM, <victor.hong@nokia.com> wrote:

> The priorities 1), 2) and 3) are pretty well stated. - Victor
> On 4/5/10 3:58 PM, "ext Andrew Purtell" <apurtell@apache.org> wrote:
> Just some ideas, possibly half-baked:
> > From: Amandeep Khurana
> > Subject: Re: Using SPARQL against HBase
> > To: hbase-user@hadoop.apache.org
> > 1. We want to have a SPARQL query engine over it that can return
> > results to queries in real time, comparable to other systems out
> > there. And since we will have HBase as the storage layer, we want
> > to scale well.
> Generally, I wonder if HBase may be able to trade disk space for query
> processing time for expected common queries.
> So part of the story here could be using coprocessors (HBASE-2000) as a
> mapping layer between the clients and the plain/simple BigTable store. For
> example, an RDF and graph relation aware coprocessor could produce and cache
> projections on the fly and use structure aware data placement strategies for
> sharding -- so the table or tables exposed to the client for enabling
> queries may be only a logical construct backed by one or more real tables
> with far different structure, and there would be intelligence for managing
> the construct running within the regionservers. Projections could be built
> lazily (via interprocess BSP?), triggered by a new query or an admin action.
> (And possibly the results could be cached with TTLs for automatic garbage
> collection for managing the total size of the store.)
> This opens up a range of implementation options that the basic BigTable
> architecture would not support. This is like installing a purpose-built RDF
> store within an existing HBase+Hadoop deployment.
> > 2. We want to enable large scale processing as well,
> > leveraging Hadoop (maybe? read about this on Cloudera's blog),
> > and maybe something like Pregel.
> Edward, didn't you do some work implementing graph operations using BSP
> message passing within the Hadoop framework? What were your findings?
> I think a coprocessor could implement a Pregel-like distributed graph
> processing model internally to the region servers, using ZooKeeper
> primitives for rendezvous.
> > These things are fluid and the first step would be to spec
> > out features that we want to build in
> In my opinion as a potential user of such a service, the design priorities
> should be something like:
> 1) Scale.
> 2) Real time queries.
> 3) Support a reasonable subset of possible queries over the data.
> Obviously both #1 and #2 are in tension with #3, so some expressiveness
> could be sacrificed.
> #1 and #2 are in tension as well. It would not be desirable to provide for
> all possible queries to be returned in real time given the cost of that is
> an unsupportable space explosion.
> My rationale for the above is a BigTable hosted RDF store could have less
> expressiveness than alternatives but that would be acceptable if the reason
> for considering the solution is the 'Big' in BigTable. But this is not the
> only consideration. Also if it can be fast for the common cases even with
> moderately sized data, it is a good alternative and may be already installed
> as part of a larger strategy employing the Hadoop stack.
> We should consider a motivating use case, or a few of them.
> For me, I'd like a canonical source of provenance. We have a patchwork of
> tracking systems. I'd like to be able to link the provenance for all of our
> workflows and data, inputs and outputs at each stage. Should support fast
> queries for weighting inputs to predictive models. Should support bulk
> queries also, so as we assess or reassess the reliability and
> trustworthiness of a source or service we would be able to trace all data
> and all conclusions contributed by the entity and all that build upon it --
> the whole cascade of it -- by following the linkage. We would be able to
> invalidate any conclusions based on data or process we deem (at some
> arbitrary time) flawed or untrustworthy. This "provenance store" would be a
> new metaindex over several workflows and data islands.
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=
> Deletions would be rare, if ever.
>   - Andy

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message