drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: Excluding HDFS .tmp file from multi-file query?
Date Thu, 22 Sep 2016 14:27:56 GMT
I noticed if you specifically use * for file matching it will still read hidden files. However
if you only point Drill at a directory it will read the directory and sub structure without
reading any hidden files.

select * from `/dir1/*`  - will read hidden files
select * from `/dir1` will not read hidden files

So it depends if you need to use file name pattern matching or not. Most of the time it is
a good idea not to mix different data in the same directory structure, but rather use the
directory structures to separate different data and types as it makes for easier matching
management down the road.

At either rate you found a solution for your needs. I have not looked if there are exclusion
parameters for pattern matching.

--Andries

> On Sep 22, 2016, at 4:11 AM, Robin Moffatt <robin.moffatt@rittmanmead.com> wrote:
> 
> Hi,
> 
> It still tried to read it, even with a . prefix:
> 
> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/*`(type => 'json'));
> Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
> length for
> LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_1074005711_265071;
> getBlockSize()=39945; corrupt=false; offset=0;
> locs=[DatanodeInfoWithStorage[192.168.10.115:50010,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK],
> DatanodeInfoWithStorage[192.168.10.114:50010,DS-6c2cd5a6-22c5-4445-9018-ca0f2549a6cf,DISK],
> DatanodeInfoWithStorage[192.168.10.117:50010
> ,DS-70946f9d-95d9-4f35-b19d-97b8dc01cb88,DISK]]}
> 
> File  /user/flume/incoming/twitter/2016/09/22/.FlumeData.1474530954642.tmp
> Record  1
> Fragment 0:0
> 
> [Error Id: 04d44b5d-6d02-4062-9f4a-ebf3831d9ba1 on
> cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
> 
> 
> ​However - good call on the Flume hdfs.inUsePrefix - by using that I can
> then set my Drill file pattern sufficiently so that it doesn't match on the
> in-use file:
> 
> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/Flume*`(type =>
> 'json'));
> +---------+
> | EXPR$0  |
> +---------+
> | 12652   |
> +---------+
> 1 row selected (6.34 seconds)
> 
> 
> So problem sidestepped for now - but would be good to understand if I
> couldn't modify the producer, if there's a way to get Drill to omit certain
> patterns from its file matching?
> 
> thanks, Robin.​
> 
> 
> On 21 September 2016 at 23:15, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
> 
>> Add a . prefix to the Flume temp files. Drill will ignore the hidden files
>> when you query the directory structure.
>> 
>> --Andries
>> 
>>> On Sep 21, 2016, at 2:36 PM, Robin Moffatt <
>> robin.moffatt@rittmanmead.com> wrote:
>>> 
>>> Hi,
>>> I have a stream of data from Flume landing in HDFS in files of a set
>> size.
>>> I can query these files individually just fine, and across multiple ones
>>> too - except if the wildcard encompasses the *currently open HDFS file
>> that
>>> Flume is writing to*. When this happens, Drill understandably barfs.
>>> 
>>> 0: jdbc:drill:drillbit=localhost> show files in
>>> `hdfs`.`/user/flume/incoming/twitter/2016/09/21/`;
>>> +------------------------------+--------------+---------+---
>> -------+--------+-------------+--------------+--------------
>> ------------+--------------------------+
>>> |             name             | isDirectory  | isFile  |  length  |
>> owner
>>> |    group    | permissions  |        accessTime        |
>>> modificationTime     |
>>> +------------------------------+--------------+---------+---
>> -------+--------+-------------+--------------+--------------
>> ------------+--------------------------+
>>> [...]
>>> | FlumeData.1474467815652      | false        | true    | 1055490  |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 21:52:07.219  | 2016-09-21
>>> 21:58:58.28   |
>>> | FlumeData.1474467815653      | false        | true    | 1050470  |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 21:58:58.556  | 2016-09-21
>>> 22:06:28.636  |
>>> | FlumeData.1474467815654      | false        | true    | 1051043  |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 22:06:29.564  | 2016-09-21
>>> 22:13:40.808  |
>>> | FlumeData.1474467815655      | false        | true    | 1052657  |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 22:13:40.978  | 2016-09-21
>>> 22:23:00.409  |
>>> | FlumeData.1474467815656.tmp  | false        | true    | 9447     |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 22:23:00.788  | 2016-09-21
>>> 22:23:00.788  |
>>> +------------------------------+--------------+---------+---
>> -------+--------+-------------+--------------+--------------
>> ------------+--------------------------+
>>> 59 rows selected (0.265 seconds)
>>> 
>>> Note the .tmp file as the last one in the folder
>>> 
>>> Querying a single file works :
>>> 
>>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
>>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/
>> FlumeData.1474467815655`(type
>>> => 'json'));
>>> +---------+
>>> | EXPR$0  |
>>> +---------+
>>> | 221     |
>>> +---------+
>>> 1 row selected (0.685 seconds)
>>> 
>>> 
>>> As does across multiple files where the wildcard pattern would exclude
>> the
>>> .tmp file:
>>> 
>>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
>>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/
>> FlumeData.147446781564*`(type
>>> => 'json'));
>>> +---------+
>>> | EXPR$0  |
>>> +---------+
>>> | 2178    |
>>> +---------+
>>> 1 row selected (1.24 seconds)
>>> 
>>> 
>>> But if I try to query all the files, Drill includes the .tmp file and
>>> errors:
>>> 
>>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
>>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type =>
>> 'json'));
>>> Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
>>> length for
>>> LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_
>> 1074004983_264343;
>>> getBlockSize()=9447; corrupt=false; offset=0;
>> locs=[DatanodeInfoWithStorage[
>>> 192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK],
>>> DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7-
>> 0bc8-44aa-945c-a140ad1f55ab,DISK],
>>> DatanodeInfoWithStorage[192.168.10.115:50010
>>> ,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]}
>>> 
>>> File  /user/flume/incoming/twitter/2016/09/21/FlumeData.
>> 1474467815656.tmp
>>> Record  1
>>> Fragment 0:0
>>> 
>>> [Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on
>>> cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
>>> 0: jdbc:drill:drillbit=localhost>
>>> 
>>> 
>>> Is there a way around this with Drill? For example, can I use a regex in
>>> the path? I've tried, but just hit
>>> Error: VALIDATION ERROR: null
>>> 
>>> thanks, Robin.
>> 
>> 


Mime
View raw message