From user-return-2281-apmail-drill-user-archive=drill.apache.org@drill.apache.org Wed May 27 15:46:01 2015 Return-Path: X-Original-To: apmail-drill-user-archive@www.apache.org Delivered-To: apmail-drill-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BDF6E18F48 for ; Wed, 27 May 2015 15:46:01 +0000 (UTC) Received: (qmail 88402 invoked by uid 500); 27 May 2015 15:46:01 -0000 Delivered-To: apmail-drill-user-archive@drill.apache.org Received: (qmail 88322 invoked by uid 500); 27 May 2015 15:46:01 -0000 Mailing-List: contact user-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@drill.apache.org Delivered-To: mailing list user@drill.apache.org Received: (qmail 88310 invoked by uid 99); 27 May 2015 15:46:01 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 May 2015 15:46:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D2E211A35D9 for ; Wed, 27 May 2015 15:46:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.001 X-Spam-Level: X-Spam-Status: No, score=0.001 tagged_above=-999 required=6.31 tests=[URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id h6QSTz4PFrMV for ; Wed, 27 May 2015 15:45:49 +0000 (UTC) Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 2A0BA453CD for ; Wed, 27 May 2015 15:45:49 +0000 (UTC) Received: by padbw4 with SMTP id bw4so673577pad.0 for ; Wed, 27 May 2015 08:44:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:content-transfer-encoding :mime-version:subject:message-id:date:references:in-reply-to:to; bh=TDSQZEGVkVc5nbXkvXDI/65YIK0wY8lpo2/bc21tLcM=; b=IoSjDJPT4bdwROFGZx3qnBH7QIJiAwre3+DcD33718bljaI6pCcbPKPAKIQQ7NJSn2 Ag+NFDsg+Gau9CLq5iK8AXi6qiUHzxETsRgwe1sDjngbeM9M2hO94SaEYrQPNmizkyBU O5EN9sf7flgR9dIIh+uD4gnXtV7ibBGKAWHyj7Bz92n9FHm21Lv/hAf6hDtDT72ZMwPZ ag4MwGky4st269O7eg/VLPFVNQGE02nnfV3jg+qN9OU1TpAq85lgq33nkSm6cCvGheCQ 3YagjZpQ/rX5o1EQsxIAGyygux8c88bGSL6NTi/7QEHAeGMq5EJdLOtF4CO4gup7MqQW hMsw== X-Gm-Message-State: ALoCoQm2N89qbytUprbm+8Fh754bZImT8kQ7nypFkNdsp54U6PfMuTdqVU6INgzIf7xtVl1oKmW0 X-Received: by 10.70.138.38 with SMTP id qn6mr58956240pdb.119.1432741451476; Wed, 27 May 2015 08:44:11 -0700 (PDT) Received: from [10.250.56.57] ([12.220.154.66]) by mx.google.com with ESMTPSA id jx5sm16554443pbc.85.2015.05.27.08.44.10 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 27 May 2015 08:44:10 -0700 (PDT) From: Andries Engelbrecht Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Subject: Re: Query local files on cluster? [Beginner] Message-Id: Date: Wed, 27 May 2015 08:44:09 -0700 References: <0C59931C-E02D-4072-A5B7-8955E3129965@maprtech.com> <4FA9551E-295A-4BBC-AE18-794E8C709C9A@gmail.com> <589031CD-E2D5-4753-AF43-84DE885259F0@gmail.com> <59C95F69-196A-40E6-8AEB-87A1A92F71FD@maprtech.com> In-Reply-To: To: "user@drill.apache.org" X-Mailer: iPad Mail (12F69) OK, that is the simplest way to get going. And see how the solution works fo= r you. It can be a little confusing between local FS and working with a clus= ter. I have found that dealing with large data volumes it worked much easier to u= se the NFS on the MapR cluster to directly move data to the DFS and bypass t= he local FS and then to DFS. Skips a step and also a lot more robust and fas= ter to get the data directly to the DFS. On May 27, 2015, at 8:37 AM, Matt wrote: >> Drill can process a lot of data quickly, and for best performance and con= sistency you will likely find that the sooner you get the data to the DFS th= e better. >=20 > Already most of the way there. Initial confusion came from the features to= query the local / native filesystem, and how that does not fit a distribute= d Drill cluster well. In other words its really an embedded / single-node Dr= ill feature. >=20 > Currently using the approach of doing a put from local filsystem into hdfs= , then CTAS into Parquet, if only for simplicity in testing (not performance= ). >=20 > Thanks! >=20 >> On 27 May 2015, at 11:29, Andries Engelbrecht wrote: >>=20 >> You will be better off to use the Drill cluster as a whole vs trying to p= lay with local vs DFS storage. >>=20 >> A couple of ideas: >> As previously mentioned you can use the robust NFS on MapR to easily plac= e the CSV/files on the DFS, and then use Drill with CTAS to convert the file= s to Parquet on the DFS. >>=20 >> You can set up a remote NFS server and map the local FS on each node to t= he same NFS mount point to the remote NFS server, this will the files will b= e consistently available to the Drillbits in the cluster and you can do CTAS= to create parquet file son the DFS. This however will likely be a lot slowe= r than the first option, as the NFS server BW will become a bottleneck if yo= u have a number of drillbits in the cluster. >>=20 >> Just copy the files to one node in the cluster and then use hadoop fs to p= ut the files in the DFS, and then do the CTAS from DFS to parquet DFS. >>=20 >> You can even place the data on S3 and then query and CTAS from there, how= ever security and bandwidth may be a concern for large data volumes, pending= the use case. >>=20 >> I really think you will find the first option the most robust and fastest= in the long run. You can point Drill at any FS source as long as it is cons= istent to all nodes in the cluster, but keep in mind that Drill can process a= lot of data quickly, and for best performance and consistency you will like= ly find that the sooner you get the data to the DFS the better. >>=20 >>=20 >>=20 >>=20 >>> On May 26, 2015, at 5:58 PM, Matt wrote: >>>=20 >>> Thanks, I am incorrectly conflating the file system with data storage. >>>=20 >>> Looking to experiment with the Parquet format, and was looking at CTAS q= ueries as an import approach. >>>=20 >>> Are direct queries over local files meant for an embedded drill, where o= n a cluster files should be moved into HDFS first? >>>=20 >>> That would make sense as files on one node would be query bound to that l= ocal filesystem. >>>=20 >>>> On May 26, 2015, at 8:28 PM, Andries Engelbrecht wrote: >>>>=20 >>>> You can use the HDFS shell >>>> hadoop fs -put >>>>=20 >>>> To copy from local file system to HDFS >>>>=20 >>>>=20 >>>> For more robust mechanisms from remote systems you can look at using NFS= , MapR has a really robust NFS integration and you can use it with the commu= nity edition. >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>> On May 26, 2015, at 5:11 PM, Matt wrote: >>>>>=20 >>>>>=20 >>>>> That might be the end goal, but currently I don't have an HDFS ingest m= echanism. >>>>>=20 >>>>> We are not currently a Hadoop shop - can you suggest simple approaches= for bulk loading data from delimited files into HDFS? >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht wrote: >>>>>>=20 >>>>>> Perhaps I=E2=80=99m missing something here. >>>>>>=20 >>>>>> Why not create a DFS plug in for HDFS and put the file in HDFS? >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>> On May 26, 2015, at 4:54 PM, Matt wrote: >>>>>>>=20 >>>>>>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appear= s text files need to be on all nodes in a cluster? >>>>>>>=20 >>>>>>> Using the dfs config below, I am only able to query if a csv file is= on all 4 nodes. If the file is only on the local node and not others, I get= errors in the form of: >>>>>>>=20 >>>>>>> ~~~ >>>>>>> 0: jdbc:drill:zk=3Des05:2181> select * from root.`customer_reviews_1= 998.csv`; >>>>>>> Error: PARSE ERROR: =46rom line 1, column 15 to line 1, column 18: T= able 'root.customer_reviews_1998.csv' not found >>>>>>> ~~~ >>>>>>>=20 >>>>>>> ~~~ >>>>>>> { >>>>>>> "type": "file", >>>>>>> "enabled": true, >>>>>>> "connection": "file:///", >>>>>>> "workspaces": { >>>>>>> "root": { >>>>>>> "location": "/localdata/hadoop/stage", >>>>>>> "writable": false, >>>>>>> "defaultInputFormat": null >>>>>>> }, >>>>>>> ~~~ >>>>>>>=20 >>>>>>>> On 25 May 2015, at 20:39, Kristine Hahn wrote: >>>>>>>>=20 >>>>>>>> The storage plugin "location" needs to be the full path to the loca= ldata >>>>>>>> directory. This partial storage plugin definition works for the use= r named >>>>>>>> mapr: >>>>>>>>=20 >>>>>>>> { >>>>>>>> "type": "file", >>>>>>>> "enabled": true, >>>>>>>> "connection": "file:///", >>>>>>>> "workspaces": { >>>>>>>> "root": { >>>>>>>> "location": "/home/mapr/localdata", >>>>>>>> "writable": false, >>>>>>>> "defaultInputFormat": null >>>>>>>> }, >>>>>>>> . . . >>>>>>>>=20 >>>>>>>> Here's a working query for the data in localdata: >>>>>>>>=20 >>>>>>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram, >>>>>>>> . . . . . . . > COLUMNS[1] AS Publication_Date, >>>>>>>> . . . . . . . > COLUMNS[2] AS Frequency >>>>>>>> . . . . . . . > FROM dfs.root.`mydata.csv` >>>>>>>> . . . . . . . > WHERE ((columns[0] =3D 'Zoological Journal of the L= innean') >>>>>>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10; >>>>>>>>=20 >>>>>>>> An complete example, not yet published on the Drill site, shows in d= etail >>>>>>>> the steps involved: >>>>>>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#exam= ple-of-querying-a-tsv-file >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> Kristine Hahn >>>>>>>> Sr. Technical Writer >>>>>>>> 415-497-8107 @krishahn >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>> On Sun, May 24, 2015 at 1:56 PM, Matt wrote: >>>>>>>>>=20 >>>>>>>>> I have used a single node install (unzip and run) to query local t= ext / >>>>>>>>> csv files, but on a 3 node cluster (installed via MapR CE), a quer= y with >>>>>>>>> local files results in: >>>>>>>>>=20 >>>>>>>>> ~~~ >>>>>>>>> sqlline version 1.1.6 >>>>>>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`; >>>>>>>>> Query failed: PARSE ERROR: =46rom line 1, column 15 to line 1, col= umn 17: >>>>>>>>> Table 'dfs./localdata/testdata.csv' not found >>>>>>>>>=20 >>>>>>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`; >>>>>>>>> Query failed: PARSE ERROR: =46rom line 1, column 15 to line 1, col= umn 17: >>>>>>>>> Table 'dfs./localdata/testdata.csv' not found >>>>>>>>> ~~~ >>>>>>>>>=20 >>>>>>>>> Is there a special config for local file querying? An initial doc s= earch >>>>>>>>> did not point me to a solution, but I may simply not have found th= e >>>>>>>>> relevant sections. >>>>>>>>>=20 >>>>>>>>> I have tried modifying the default dfs config to no avail: >>>>>>>>>=20 >>>>>>>>> ~~~ >>>>>>>>> "type": "file", >>>>>>>>> "enabled": true, >>>>>>>>> "connection": "file:///", >>>>>>>>> "workspaces": { >>>>>>>>> "root": { >>>>>>>>> "location": "/localdata", >>>>>>>>> "writable": false, >>>>>>>>> "defaultInputFormat": null >>>>>>>>> } >>>>>>>>> ~~~ >>>>=20