Our use case is to analyze images using Spark. The images are typically ~1MB each, so in order to prevent the small files problem in HDFS, we went with HBase and Phoenix. For 20+ million images and metadata, this has been working pretty well so far. Since this is pretty new to us, we didn't create a robust design:

CREATE TABLE IF NOT EXISTS mytable
(
    id VARCHAR(36) NOT NULL PRIMARY KEY,
    title VARCHAR,
    ...
    image.dtype VARCHAR(12),
    image.width UNSIGNED_INT,
    image.height UNSIGNED_INT,
    image.data VARBINARY
)


Most queries are on the metadata, so all of that is kept in the default column family. Only the image data is stored in a secondary column family. Additional indexes are created anyways, so the main table isn't usually touched.

We first run a Phoenix query to check if there are any matches. If so, then we start a Spark job on the images. The primary keys are sent to the PySpark job, which then grabs the images based on the primary keys:

df = sqlContext.read \
    .format('org.apache.phoenix.spark') \
    .option('table', 'mytable') \
    .option('zkUrl', 'localhost:2181:/hbase-unsecure') \
    .load()
df.registerTempTable('mytable')

query =
df_imgs = sqlContext.sql(
   
'SELECT IMAGE FROM mytable WHERE ID = 1 OR ID = 2 ...'
)


When this was first designed, we thought since the lookup was by primary key, it would be smart enough to do a skip scan, but it appears to be doing a full scan. The df_imgs.rdd.getNumPartitions() ends up being 450+, which matches up with the number of split files in HDFS.

Would it be better to use a foreign key and split the tables :

CREATE TABLE IF NOT EXISTS mytable
(
    id VARCHAR(36) NOT NULL PRIMARY KEY,
    title VARCHAR,
    image_id VARCHAR(36)
)
CREATE TABLE IF NOT EXISTS images
(
    image_id VARCHAR(36) NOT NULL PRIMARY KEY,
    dtype VARCHAR(12),
    width UNSIGNED_INT,
    height UNSIGNED_INT,
    data VARBINARY
)


If the first query grabs the image_ids and send them to Spark, would Spark be able to handle the query more efficiently?

If this is a better design, is there any way of moving the "image" column family from "mytable" to the default column family of the new "images" table? Is it possible to create the new table with the "image_id"s, make the foreign keys, then move the column family into the new table?


--
Mark Heppner