From user-return-51340-apmail-hbase-user-archive=hbase.apache.org@hbase.apache.org Wed May 11 23:11:24 2016 Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 747B3190BF for ; Wed, 11 May 2016 23:11:24 +0000 (UTC) Received: (qmail 40464 invoked by uid 500); 11 May 2016 23:11:22 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 40387 invoked by uid 500); 11 May 2016 23:11:22 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 40371 invoked by uid 99); 11 May 2016 23:11:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 May 2016 23:11:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id EC7E51802E7 for ; Wed, 11 May 2016 23:11:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.12 X-Spam-Level: X-Spam-Status: No, score=-0.12 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=esgyn.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 2eeFQKMszwia for ; Wed, 11 May 2016 23:11:20 +0000 (UTC) Received: from mail-yw0-f170.google.com (mail-yw0-f170.google.com [209.85.161.170]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3AFF55F56D for ; Wed, 11 May 2016 23:11:19 +0000 (UTC) Received: by mail-yw0-f170.google.com with SMTP id j74so57719103ywg.1 for ; Wed, 11 May 2016 16:11:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=esgyn.com; s=google; h=from:references:in-reply-to:mime-version:thread-index:date :message-id:subject:to; bh=yvmw/4cLWv3Jb8a+scV+FanhQykFmCfP50Z5gPzTzCI=; b=PIFFdhgFzx5xuqvjUZHu9jz5pE7hH+nu/DC+hRf8XwnLb7DZUUpBX/mRboNAPwRrsc IM53KTVwr0Rfycnjb/mmmLBVJ6uga7Pj8yqbFKG7hR+F4nUkQtBVIcVJ/6MDNUIjkNxA wwDkm2E/BrddtQ112uVMZf6Ua27oLkup9TKF8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:references:in-reply-to:mime-version :thread-index:date:message-id:subject:to; bh=yvmw/4cLWv3Jb8a+scV+FanhQykFmCfP50Z5gPzTzCI=; b=N4oMKiuo46mop9ewVllx19rxPoyZDYQyT/ZrSSBORuqUSBqWmnUM0/u3Kfoim9p6cB 1JcmIQwkNvwVhtKvZ06oGv8jXDEGZ7RgBjl4uObOlAWP/pCsmwBgxgTnWYhhLqi7Ask4 zMrnEKNF4FcG8KMgoq1XyuTBbCz59AV/oQD6wBiTAvi7h4C30LwHyB0m/vgyvSxn9dUo 4GuS/2t6HMnzM0V/lyd0gc8RWHQFOije0/TYjTC5toE0V/dpq6zYYta1lPUpjgFDBtQx iVi3QfNww6MXpIpWrRkHU7GghpNhnQ9qJa5dz6j+9uLbzXk6LK0NLe9UcXJKzvvf9fnE Shmg== X-Gm-Message-State: AOPr4FW4KKKttVg4FvYHJs19edaH1VO9vQqPPY4bte5bThViQLT7qEOr35tWYapnF2xSoCbgvJmajPISkDCEWg== X-Received: by 10.13.221.212 with SMTP id g203mr3069919ywe.68.1463008278325; Wed, 11 May 2016 16:11:18 -0700 (PDT) From: Dave Birdsall References: In-Reply-To: MIME-Version: 1.0 X-Mailer: Microsoft Outlook 15.0 Thread-Index: AQKVtJzA7sqBtySvZjRwUWlZ6qRM554sSrAA Date: Wed, 11 May 2016 16:10:15 -0700 Message-ID: Subject: RE: Able to search by all the columns and faster than impala To: user@hbase.apache.org Content-Type: text/plain; charset=UTF-8 Hi, If your SQL-on-Hadoop solution supports secondary indexes, you can simply create those on the popular columns to speed up query time. Dave -----Original Message----- From: Bin Wang [mailto:binwang.cu@gmail.com] Sent: Wednesday, May 11, 2016 2:22 PM To: user@hbase.apache.org Subject: Able to search by all the columns and faster than impala Hi there, I have a use case here where I have a table that have low billions of rows and less than 50 columns. This is a very popular data sources where there is a huge demand internally people want to query the table. Nothing more complex than "select * from where .. and .." However, not everyone is able to write SQL and we are thinking about building this into a web application. However, we usually use the tool Hive and Impala and neither of them is fast enough to be the backend of a web application. (think about what happens if you click a button and it returned the result back to you in minutes... annoying!) We tried to dump the data into Solrcloud and it worked really well if the number of returned results is small.. but the performance is really back even when the row count is a few thousand... We also tried to use HBase where it is really hard to design a schema since a full table scan takes lots of time anyway. Then I have an idea that we can write a map reduce to turn the data into a melted format. And then we group them by column_name and cell_value, and build a list of row-id where this value appear, sort of the inverted index of Lucene. so we have two tables in HBase. So when user issues a query maybe where 3 where conditions, it will first quickly go the the inverted index to search for what are the row ids that meet each condition, then we find the intersection for those three lists. Now we have a list of all the row id where meet the three criteria. In the end, we look up those row id in another HBase table to retrieve the rows. Can anyone shed some lights on this? Best regards, Bin