hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Li" <ning.li...@gmail.com>
Subject Re: Multi get/put
Date Tue, 05 Aug 2008 14:58:36 GMT
We have been working on supporting Lucene-based index in HBase.
In a nutshell, we extend the region to support indexing on column(s).

We have a working implementation of our design. An overview of our
design and the preliminary performance evaluation is provided below.
We welcome feedback and we would be happy to contribute the code
to HBase once the major performance issue is resolved.

An index can be created for a column, a column family or all the
columns. In the implementation, we extend the HRegion class so that
it not only manages store files which stores the column values of a
region, but also Lucene instances which are used to support indexing
on columns.

The following assumes a per-column index and in the end we'll briefly
describe how per-column family index and all-column index work.

Upon receiving a column update request, a region not only adds the
column to the cache part of the store, but also analyzes the column
and adds it to the cache part of the index. Same as the store files,
the Lucene index files are also written to HDFS.

Following the HBase design, to avoid resource contention, a region
server globally schedules the cache flush and the compaction of both
the store files and the index files of all the regions on the server.

We add to HTable the following method to enable querying an index.
    Results search(range, column, query, max_num_hits);
Depending on the specified key range, a client sends a search request
to one or more region servers, who call the search method of queried
regions. The client will merge the results from all the queried regions.

In the current implementation, queries are conducted on the index files
stored in HDFS.

The region split works the same way as before - in addition to creating
reference files for store files, reference files are also created for index
files in the child regions. The old parent region will be deleted once
all the reference files are deleted.

Our preliminary performance experiments show that the performance
of building an index is quite reasonable. However, the performance of
random reads in HDFS is so poor that the search performance is
dramatically worse than that on local file systems.

We are exploring different ways to solve this problem. One possibility
is to store a copy on local file system. On the other hand, most likely
HDFS already stores a local copy...

As we mentioned earlier, an index can also be created for a column
family or for all the columns. If an index is created for a column family,
whenever a column is updated, the rest of the column family needs to
be retrieved to re-index the column family. This adds some overhead
to the indexing process. Also, it's open what the best versioning
semantics is.

View raw message