From user-return-7986-apmail-drill-user-archive=drill.apache.org@drill.apache.org Fri Jun 9 18:36:14 2017 Return-Path: X-Original-To: apmail-drill-user-archive@www.apache.org Delivered-To: apmail-drill-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B52D19019 for ; Fri, 9 Jun 2017 18:36:14 +0000 (UTC) Received: (qmail 31979 invoked by uid 500); 9 Jun 2017 18:36:13 -0000 Delivered-To: apmail-drill-user-archive@drill.apache.org Received: (qmail 31902 invoked by uid 500); 9 Jun 2017 18:36:13 -0000 Mailing-List: contact user-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@drill.apache.org Delivered-To: mailing list user@drill.apache.org Received: (qmail 31889 invoked by uid 99); 9 Jun 2017 18:36:13 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jun 2017 18:36:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id EA986C0708 for ; Fri, 9 Jun 2017 18:36:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.978 X-Spam-Level: * X-Spam-Status: No, score=1.978 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=mapr.onmicrosoft.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id s_ybczLT6Bm8 for ; Fri, 9 Jun 2017 18:36:09 +0000 (UTC) Received: from NAM01-SN1-obe.outbound.protection.outlook.com (mail-sn1nam01on0090.outbound.protection.outlook.com [104.47.32.90]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id A085B5F2F1 for ; Fri, 9 Jun 2017 18:36:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mapr.onmicrosoft.com; s=selector1-mapr-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=7XYq8MBDP4S4p0Tv18bOHepizybl80IhLLL8V2rvQcw=; b=B7PeHBd8kun2ucF6pe73g3SELMn0V1XSe7rFAeQ43tvw4Nuqzgy1r2OMCTxCnlnwdxPhTzNfYJtpHPn9G6WkHC0PbiJgvM1Xyjdp8mh2aEN0up+l68bXNMi+wilD7IrcZ9n+tlNU8D9RRPZeCKR3X25TmR5CYwAfaGt+sMf9Cfw= Received: from CY4PR16MB1688.namprd16.prod.outlook.com (10.171.209.138) by CY4PR16MB1687.namprd16.prod.outlook.com (10.171.209.137) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1124.9; Fri, 9 Jun 2017 18:36:01 +0000 Received: from CY4PR16MB1688.namprd16.prod.outlook.com ([10.171.209.138]) by CY4PR16MB1688.namprd16.prod.outlook.com ([10.171.209.138]) with mapi id 15.01.1124.020; Fri, 9 Jun 2017 18:36:01 +0000 From: Kunal Khatua To: "user@drill.apache.org" Subject: Re: Increasing store.parquet.block-size Thread-Topic: Increasing store.parquet.block-size Thread-Index: AQHS4PCpNOhg3/GRJkWHYIVre1RMNKIcU26AgAAXdwCAAF9BvYAABaAAgAAAt/uAAAingIAAApY7 Date: Fri, 9 Jun 2017 18:36:01 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: drill.apache.org; dkim=none (message not signed) header.d=none;drill.apache.org; dmarc=none action=none header.from=mapr.com; x-originating-ip: [2603:10b6:910:3d:cafe::20] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;CY4PR16MB1687;7:FMOFq8dyND93OOagUOnY0d8Yx95prs0bwJh92B2761h1OXV4U6QaHge01DnbEr/aTweZ0laHk4qGmTZX8vS0m66HigEx563yoG2Euoye6qt9XFxuioJMUrRFiRPPifVmWBN/NAwyd4PUFOGTIqwL2mAdp1XJ0weCNyhXNk8Iza+8jhcT4lIb7HPv5O5nBrKJGO4V7FRFR9p9V91ELIlXoefhg6OuR9ZJ6VQOuJLzEJcHVNn9Fyan/F1gE0qUEjukKky1IXImfKjT+Nr0//eJAXbfQSjFdm8gOeSVHob83tbv4n8UYxFa2b9qtGUFlciF2WqZG5Y/2DwuMMUkEFAFNQ== x-ms-traffictypediagnostic: CY4PR16MB1687: x-ms-office365-filtering-correlation-id: 8a20d3c6-0a6b-414e-4c3e-08d4af66613f x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001)(2017030254075)(201703131423075)(201703031133081)(201702281549075);SRVR:CY4PR16MB1687; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(134217032509453)(158342451672863); x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(93006095)(93001095)(100000703101)(100105400095)(6041248)(20161123560025)(20161123558100)(20161123564025)(20161123562025)(20161123555025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(6072148)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:CY4PR16MB1687;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:CY4PR16MB1687; x-forefront-prvs: 03333C607F x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(979002)(39450400003)(39840400002)(39400400002)(39410400002)(377454003)(51914003)(24454002)(6436002)(86362001)(3280700002)(2351001)(5660300001)(99286003)(606005)(236005)(53936002)(33656002)(8676002)(7696004)(102836003)(2501003)(77096006)(6306002)(14454004)(9686003)(1730700003)(81166006)(5640700003)(55016002)(25786009)(6506006)(122556002)(6116002)(54896002)(2906002)(38730400002)(8936002)(54356999)(2950100002)(50986999)(76176999)(53546009)(7906003)(6916009)(3660700001)(2900100001)(229853002)(7736002)(6246003)(93886004)(189998001)(110136004)(478600001)(74316002)(969003)(989001)(999001)(1009001)(1019001);DIR:OUT;SFP:1102;SCL:1;SRVR:CY4PR16MB1687;H:CY4PR16MB1688.namprd16.prod.outlook.com;FPR:;SPF:None;MLV:ovrnspm;PTR:InfoNoRecords;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: multipart/alternative; boundary="_000_CY4PR16MB16880AB37322C6DA7F8AA8B0D2CE0CY4PR16MB1688namp_" MIME-Version: 1.0 X-OriginatorOrg: mapr.com X-MS-Exchange-CrossTenant-originalarrivaltime: 09 Jun 2017 18:36:01.7739 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 2573c0c8-6f2a-4418-a58a-a742cf6415fb X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR16MB1687 --_000_CY4PR16MB16880AB37322C6DA7F8AA8B0D2CE0CY4PR16MB1688namp_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable The ideal size depends on what engine is consuming the parquet files (Drill= , i'm guessing).... and the storage layer. For HDFS, which is usually 128-2= 56GB, we recommend to bump it to about 512GB (with the underlying HDFS bloc= ksize to match that). You'll probably need to experiment a little with different blocks sizes sto= red on S3 to see which works the best. ________________________________ From: Shuporno Choudhury Sent: Friday, June 9, 2017 11:23:37 AM To: user@drill.apache.org Subject: Re: Increasing store.parquet.block-size Thanks for the information Kunal. After the conversion, the file size scales down to half if I use gzip compression. For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file (using gzip compression). So, if I have to make multiple parquet files, what block size would be optimal, if I have to read the file later? On 09-Jun-2017 11:28 PM, "Kunal Khatua" wrote: > > If you're storing this in S3... you might want to selectively read the > files as well. > > > I'm only speculating, but if you want to download the data, downloading a= s > a queue of files might be more reliable than one massive file. Similarly, > within AWS, it *might* be faster to have an EC2 instance access a couple = of > large Parquet files versus one massive Parquet file. > > > Remember that when you create a large block size, Drill tries to write > everything within a single row group for each. So there is no chance of > parallelization of the read (i.e. reading parts in parallel). The default= s > should work well for S3 as well, and with the compression (e.g. Snappy), > you should get a reasonably smaller file size. > > > With the current default settings... have you seen what Parquet file size= s > you get with Drill when converting your 10GB CSV source files? > > > ________________________________ > From: Shuporno Choudhury > Sent: Friday, June 9, 2017 10:50:06 AM > To: user@drill.apache.org > Subject: Re: Increasing store.parquet.block-size > > Thanks Kunal for your insight. > I am actually converting some .csv files and storing them in parquet form= at > in s3, not in HDFS. > The size of the individual .csv source files can be quite huge (around > 10GB). > So, is there a way to overcome this and create one parquet file or do I > have to go ahead with multiple parquet files? > > On 09-Jun-2017 11:04 PM, "Kunal Khatua" wrote: > > > Shuporno > > > > > > There are some interesting problems when using Parquet files > 2GB on > HDFS. > > > > > > If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddl= y > > enough) returns an int value. Large Parquet blocksize also means you'll > end > > up having the file span across multiple HDFS blocks, and that would mak= e > > reading of rowgroups inefficient. > > > > > > Is there a reason you want to create such a large parquet file? > > > > > > ~ Kunal > > > > ________________________________ > > From: Vitalii Diravka > > Sent: Friday, June 9, 2017 4:49:02 AM > > To: user@drill.apache.org > > Subject: Re: Increasing store.parquet.block-size > > > > Khurram, > > > > DRILL-2478 is a good place holder for the LongValidator issue, it reall= y > > works wrong. > > > > But other issue connected to impossibility to use long values for parqu= et > > block-size. > > This issue can be independent task or a sub-task of updating Drill > project > > to a latest parquet library. > > > > Kind regards > > Vitalii > > > > On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz > wrote: > > > > > 1. DRILL-2478 is > > > Open for this issue. > > > 2. I have added more details into the comments. > > > > > > Thanks, > > > Khurram > > > > > > ________________________________ > > > From: Shuporno Choudhury > > > Sent: Friday, June 9, 2017 12:48:41 PM > > > To: user@drill.apache.org > > > Subject: Increasing store.parquet.block-size > > > > > > The max value that can be assigned to *store.parquet.block-size *is > > > *2147483647*, as the value kind of this configuration parameter is > LONG. > > > This basically translates to 2GB of block size. > > > How do I increase it to 3/4/5 GB ? > > > Trying to set this parameter to a higher value using the following > > command > > > actually succeeds : > > > ALTER SYSTEM SET `store.parquet.block-size` =3D 4294967296; > > > But when I try to run a query that uses this config, it throws the > > > following error: > > > Error: SYSTEM ERROR: NumberFormatException: For input string: > > > "4294967296" > > > So, is it possible to assign a higher value to this parameter? > > > -- > > > Regards, > > > Shuporno Choudhury > > > > > > --_000_CY4PR16MB16880AB37322C6DA7F8AA8B0D2CE0CY4PR16MB1688namp_--