From dev-return-5150-apmail-hawq-dev-archive=hawq.apache.org@hawq.incubator.apache.org Thu Sep 22 15:25:47 2016 Return-Path: X-Original-To: apmail-hawq-dev-archive@minotaur.apache.org Delivered-To: apmail-hawq-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 29BF519208 for ; Thu, 22 Sep 2016 15:25:47 +0000 (UTC) Received: (qmail 52646 invoked by uid 500); 22 Sep 2016 15:25:47 -0000 Delivered-To: apmail-hawq-dev-archive@hawq.apache.org Received: (qmail 52578 invoked by uid 500); 22 Sep 2016 15:25:47 -0000 Mailing-List: contact dev-help@hawq.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hawq.incubator.apache.org Delivered-To: mailing list dev@hawq.incubator.apache.org Received: (qmail 52565 invoked by uid 99); 22 Sep 2016 15:25:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2016 15:25:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 575ADC145A for ; Thu, 22 Sep 2016 15:25:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.377 X-Spam-Level: **** X-Spam-Status: No, score=4.377 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=2.397] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=pivotal-io.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 7iLi1tlOHifT for ; Thu, 22 Sep 2016 15:25:44 +0000 (UTC) Received: from mail-yw0-f182.google.com (mail-yw0-f182.google.com [209.85.161.182]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id ACB335FB4A for ; Thu, 22 Sep 2016 15:25:43 +0000 (UTC) Received: by mail-yw0-f182.google.com with SMTP id u82so93239856ywc.2 for ; Thu, 22 Sep 2016 08:25:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pivotal-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=VKF8YQ9sGdOquJqklX0saCuea+LRV+56/VWyflnzZwM=; b=HB9Ba1lOy7hMWhEn8+BVqw3EbD/nJociyxLU5v8mctzQ2u/p/k2pN+Km0BUemHaCtU 8hyrNijoty/xAl9n5+r9jhMlcI3wZeNmLLy92+D6mCHAdyxbZGWXe/j/Ru3QqO/QuYz7 mKqP8LNG/T12jX8rQIEVzSCaQ0MkLRpTyzFzIethsU8SLq2o/AYXpGK0Ka6TdScNj+4D YMdmM8xbx/Vv8PmX+cu64ZS8BDKvDn48J0ql3fZlSKaJU4hHM+ps2NlSYqJlSN5umhaG 2Qr2hxJtSXSIyDtOup/i2rPgD5N/C/BWWAEBBYTAVtsoP+hEI6SMsgXjll12lx9r+AF2 1xLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=VKF8YQ9sGdOquJqklX0saCuea+LRV+56/VWyflnzZwM=; b=EamjEj5iflYT40jsG9CBSm2pMMH63wvkM7Oz4zm17ImZBa06+1AvODabLX9ajNMuWC rg8Mhk7tXyR+ZdQTVPpNzCytoTkeXSgWe/aJCbQm3zaE2MOojk5c3YgIvQ1dDY2p1Ycr 6xB595Geqzce55BDHcMjzAcc7Z4o1LFvWTJooPI4REpAXCLEx3vduOv0pcIjYFIkIEU3 xv8GSpHLQ6zY3fKZIPl1/F4sjY63Gxh+99IA5I9gU4sj4MJFyFcAVX/xFUgJQq5JvPSR MpCjXg2jpRVdBLa8mkjqSiUHUARfqr2QCbyA+ACa+1sI+gfcGY6UletJBFIXPhxpB206 ASQw== X-Gm-Message-State: AE9vXwONk6sjA2YKUf+21i1qzqdP4QrDpn5rcY0d0u5Meilo3++xJDlMa3ow1FCNbMOFKlhoQcTwELnNLqHpEP43 X-Received: by 10.13.206.134 with SMTP id q128mr1864409ywd.270.1474557937479; Thu, 22 Sep 2016 08:25:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.108.4 with HTTP; Thu, 22 Sep 2016 08:25:37 -0700 (PDT) In-Reply-To: References: From: Hubert Zhang Date: Thu, 22 Sep 2016 23:25:37 +0800 Message-ID: Subject: Re: Co-located Joins & Data Locality in HAWQ To: dev@hawq.incubator.apache.org Cc: "user@hawq.incubator.apache.org" Content-Type: multipart/alternative; boundary=001a114e4e8610a695053d1a46e9 --001a114e4e8610a695053d1a46e9 Content-Type: text/plain; charset=UTF-8 Randomly distributed tables make Hawq2.x more elastic: big queries use more resources, while small queries will use less resources. But hash distributed table need to use the same number of resource as the bucket number of table, no matter the query cost is large or small. As a result, a scan on a very small table will also need a large number(default 6 * #node) of resources. Moreover, HDFS blocks are distributed on different nodes. To achieve better local disk read ratio, the data locality algorithm assign each block to the appropriate virtual segment for randomly distributed table. But for hash distributed table, all the blocks in one file must be processed by one virtual segment(property of hash table), so we have to assign hash table at file granularity. If blocks in a file are located on quite different nodes, data locality ratio would become low, which often happens after cluster expanding or shrinking. So even if hash distributed table will reduce some data motion, we still recommend to use randomly distributed table as default in Hawq2.x. Thanks Hubert On Wed, Sep 21, 2016 at 2:14 PM, Vineet Goel wrote: > Hi all, > > I have received a fair number of questions on the topic of handling data > locality and co-located joins in HAWQ 2. Most of the questions are coming > from the background where HAWQ 1.x defaulted to HASH distributed tables > distributed by a key and hence resulted in local joins in most cases for > better performance. > > With the new architecture and RANDOM distribution policy as default, I > thought it would be good to crowd-source some useful info here from the > community on how performance is achieved with the new architecture and data > distribution policy? Questions around how data movement is minimized, > how/when dynamic redistribution is utilized, how joins are co-located etc. > Can someone start by providing insights on this topic? > > Regards, > Vineet > -- Thanks Hubert Zhang --001a114e4e8610a695053d1a46e9--