I've been the one building out this spark functionality in hbase so maybe I can help clarify.
The hbase-spark module is just focused on making spark integration with hbase easy and out of the box for both spark and spark streaming.
I and I believe the hbase team has no desire to build a sql engine in hbase. This jira comes the closest to that line. The main thing here is filter push down logic for basic sql operation like =, >
, and <. User define functions and secondary indexes are not in my scope.
Another main goal of hbase-spark module is to be able to allow a user to do anything they did with MR/HBase now with Spark/Hbase. Things like bulk load.
Let me know if u have any questions
We have not “formally” published any numbers yet. A good reference is a slide deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by themselves for now.
As for status quo of Astro, we have been focusing on fixing bugs (UDF-related bug in some coprocessor/custom filter combos), and add support of querying string columns in HBase as integers from Astro.
Where can I find performance numbers for Astro (it's close to middle of August) ?
On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc <Yan.Zhou.firstname.lastname@example.org> wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package so will evolve and function more closely with Spark SQL/Dataframe instead of HBase.
In terms of architecture, my take is loosely-coupled query engines on top of KV store vs. an array of query engines supported by, and packaged as part of, a KV store.
Functionality-wise the two could be close but Astro also supports Python as a result of tight integration with Spark.
It will be interesting to see performance comparisons when HBase-14181 is ready.
HBase will not have query engine.
It will provide better support to query engines.
On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc <Yan.Zhou.email@example.com> wrote:
I’m in China now, and seem to experience difficulty to access Apache Jira. Anyways, it appears to me that HBASE-14181 attempts to support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of coprocessor, custom filter, …, etc., which allows for loosely-coupled query engines
built on top of it.
Yan / Bing:
Mind taking a look at HBASE-14181 'Add Spark DataFrame DataSource to HBase-Spark Module' ?
On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) <firstname.lastname@example.org> wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 release. http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:
· HBase pushdown capabilities
· SQL, Data Frame support
· More SQL capabilities made possible (Secondary index, bloom filter, Primary Key, Bulk load, Update)
· Joins with data from other sources
· Python/Java/Scala support
· Support latest Spark 1.4.0 release
The tests by Huawei team and community contributors covered the areas: bulk load; projection pruning; partition pruning; partial evaluation; code generation; coprocessor; customer filtering; DML; complex filtering on keys and non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the integration tests with various combinations of the settings, extensive Data Frame tests, complex join/union test and extensive performance tests. Please use the “Issues” “Pull Requests” links at this package homepage, if you want to report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global team, community contributors and Databricks. Databricks has been providing great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency We will continue to work with the community to develop new features and improve code base. Your comments and suggestions are greatly appreciated.
Yan Zhou / Bing Xiao
Huawei Big Data team