From user-return-6875-apmail-drill-user-archive=drill.apache.org@drill.apache.org Wed Sep 21 22:16:00 2016 Return-Path: X-Original-To: apmail-drill-user-archive@www.apache.org Delivered-To: apmail-drill-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2453919764 for ; Wed, 21 Sep 2016 22:16:00 +0000 (UTC) Received: (qmail 3628 invoked by uid 500); 21 Sep 2016 22:15:59 -0000 Delivered-To: apmail-drill-user-archive@drill.apache.org Received: (qmail 3563 invoked by uid 500); 21 Sep 2016 22:15:59 -0000 Mailing-List: contact user-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@drill.apache.org Delivered-To: mailing list user@drill.apache.org Received: (qmail 3550 invoked by uid 99); 21 Sep 2016 22:15:59 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Sep 2016 22:15:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EBB0DC0586 for ; Wed, 21 Sep 2016 22:15:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.698 X-Spam-Level: X-Spam-Status: No, score=0.698 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=maprtech.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id yWPyfleUJIwE for ; Wed, 21 Sep 2016 22:15:56 +0000 (UTC) Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id B2B035F478 for ; Wed, 21 Sep 2016 22:15:55 +0000 (UTC) Received: by mail-pa0-f47.google.com with SMTP id hm5so22364705pac.0 for ; Wed, 21 Sep 2016 15:15:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=maprtech.com; s=google; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=r/Y2cY/UeuxS0eicYFh1MRf24AeMEAWqlbDRTAtZa24=; b=ap2z+q071U5XeTnd2JI9stn/eGJAtd6NchYi4Xb/HyW7l+7ceaoHG8o2CiSzBfLjOE ykF91NdfXlvZ8Yrbz9m5v29SRhnuA8LA2ni0Wqrt77LSeq3irL7h9+Kpu+015ToMKiv2 lj7EXB8UR0xZSfYc2r/m2y+KcAxKey/GisHSc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=r/Y2cY/UeuxS0eicYFh1MRf24AeMEAWqlbDRTAtZa24=; b=ixY5tQ1iH4XWYBCPBo8vQogSi/fyxT40SnuvN9fS8KoawwT6Ms9axhbobhPsGLvpYL a2RdaSUzePPocAss/kgk61MbS5+yrTiRFjHBBUN9QUbXfCtY7gMUpDA5dV1HwqrHioGV SGZcUrxXNlOUG9Mb290wp7Ln5TDtsbC/444ucWP7PQVFopH5NPlQigWbVP7qVt8mCfV0 QT+yOxlVBhUIDMxlyLayWNyqLBa4jBXX8DfADGB6gGtTOKrvlV6C3hR9EYuD3ZjzfgHQ ueJj3WtlfucNA+NYyV6sKBWKVfmiiSjBVZ6kE1Z1TuY0gZiO5POw8KS4wJ8HY+eNxWob 0qvw== X-Gm-Message-State: AE9vXwPFMW7/rhZwU1MSyfRDZjYKRXttrnWm6Vvbqf8KduxOgLPxcaRohlzFwYbkgCSX4ECB X-Received: by 10.66.123.79 with SMTP id ly15mr69765185pab.157.1474496147564; Wed, 21 Sep 2016 15:15:47 -0700 (PDT) Received: from [192.168.7.25] (astound-66-234-202-236.ca.astound.net. [66.234.202.236]) by smtp.gmail.com with ESMTPSA id k77sm184914pfj.65.2016.09.21.15.15.46 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 21 Sep 2016 15:15:46 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Excluding HDFS .tmp file from multi-file query? From: Andries Engelbrecht In-Reply-To: Date: Wed, 21 Sep 2016 15:15:45 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <9158BEC1-938B-4499-807E-796BF9DD5673@maprtech.com> References: To: user@drill.apache.org X-Mailer: Apple Mail (2.3124) Add a . prefix to the Flume temp files. Drill will ignore the hidden = files when you query the directory structure. --Andries > On Sep 21, 2016, at 2:36 PM, Robin Moffatt = wrote: >=20 > Hi, > I have a stream of data from Flume landing in HDFS in files of a set = size. > I can query these files individually just fine, and across multiple = ones > too - except if the wildcard encompasses the *currently open HDFS file = that > Flume is writing to*. When this happens, Drill understandably barfs. >=20 > 0: jdbc:drill:drillbit=3Dlocalhost> show files in > `hdfs`.`/user/flume/incoming/twitter/2016/09/21/`; > = +------------------------------+--------------+---------+----------+------= --+-------------+--------------+--------------------------+---------------= -----------+ > | name | isDirectory | isFile | length | = owner > | group | permissions | accessTime | > modificationTime | > = +------------------------------+--------------+---------+----------+------= --+-------------+--------------+--------------------------+---------------= -----------+ > [...] > | FlumeData.1474467815652 | false | true | 1055490 | = flume > | supergroup | rw-r--r-- | 2016-09-21 21:52:07.219 | 2016-09-21 > 21:58:58.28 | > | FlumeData.1474467815653 | false | true | 1050470 | = flume > | supergroup | rw-r--r-- | 2016-09-21 21:58:58.556 | 2016-09-21 > 22:06:28.636 | > | FlumeData.1474467815654 | false | true | 1051043 | = flume > | supergroup | rw-r--r-- | 2016-09-21 22:06:29.564 | 2016-09-21 > 22:13:40.808 | > | FlumeData.1474467815655 | false | true | 1052657 | = flume > | supergroup | rw-r--r-- | 2016-09-21 22:13:40.978 | 2016-09-21 > 22:23:00.409 | > | FlumeData.1474467815656.tmp | false | true | 9447 | = flume > | supergroup | rw-r--r-- | 2016-09-21 22:23:00.788 | 2016-09-21 > 22:23:00.788 | > = +------------------------------+--------------+---------+----------+------= --+-------------+--------------+--------------------------+---------------= -----------+ > 59 rows selected (0.265 seconds) >=20 > Note the .tmp file as the last one in the folder >=20 > Querying a single file works : >=20 > 0: jdbc:drill:drillbit=3Dlocalhost> SELECT count(*) FROM > = table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815= 655`(type > =3D> 'json')); > +---------+ > | EXPR$0 | > +---------+ > | 221 | > +---------+ > 1 row selected (0.685 seconds) >=20 >=20 > As does across multiple files where the wildcard pattern would exclude = the > .tmp file: >=20 > 0: jdbc:drill:drillbit=3Dlocalhost> SELECT count(*) FROM > = table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815= 64*`(type > =3D> 'json')); > +---------+ > | EXPR$0 | > +---------+ > | 2178 | > +---------+ > 1 row selected (1.24 seconds) >=20 >=20 > But if I try to query all the files, Drill includes the .tmp file and > errors: >=20 > 0: jdbc:drill:drillbit=3Dlocalhost> SELECT count(*) FROM > table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type =3D> = 'json')); > Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain = block > length for > = LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_1074004983_2643= 43; > getBlockSize()=3D9447; corrupt=3Dfalse; offset=3D0; = locs=3D[DatanodeInfoWithStorage[ > 192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK], > = DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7-0bc8-44aa-945c-a1= 40ad1f55ab,DISK], > DatanodeInfoWithStorage[192.168.10.115:50010 > ,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]} >=20 > File = /user/flume/incoming/twitter/2016/09/21/FlumeData.1474467815656.tmp > Record 1 > Fragment 0:0 >=20 > [Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on > cdh57-01-node-01.moffatt.me:31010] (state=3D,code=3D0) > 0: jdbc:drill:drillbit=3Dlocalhost> >=20 >=20 > Is there a way around this with Drill? For example, can I use a regex = in > the path? I've tried, but just hit > Error: VALIDATION ERROR: null >=20 > thanks, Robin.