drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: Query local files on cluster? [Beginner]
Date Wed, 27 May 2015 15:29:12 GMT
You will be better off to use the Drill cluster as a whole vs trying to play with local vs
DFS storage.

A couple of ideas:
As previously mentioned you can use the robust NFS on MapR to easily place the CSV/files on
the DFS, and then use Drill with CTAS to convert the files to Parquet on the DFS.

You can set up a remote NFS server and map the local FS on each node to the same NFS mount
point to the remote NFS server, this will the files will be consistently available to the
Drillbits in the cluster and you can do CTAS to create parquet file son the DFS. This however
will likely be a lot slower than the first option, as the NFS server BW will become a bottleneck
if you have a number of drillbits in the cluster.

Just copy the files to one node in the cluster and then use hadoop fs to put the files in
the DFS, and then do the CTAS from DFS to parquet DFS.

You can even place the data on S3 and then query and CTAS from there, however security and
bandwidth may be a concern for large data volumes, pending the use case.

I really think you will find the first option the most robust and fastest in the long run.
You can point Drill at any FS source as long as it is consistent to all nodes in the cluster,
but keep in mind that Drill can process a lot of data quickly, and for best performance and
consistency you will likely find that the sooner you get the data to the DFS the better. 




> On May 26, 2015, at 5:58 PM, Matt <bsg075@gmail.com> wrote:
> 
> Thanks, I am incorrectly conflating the file system with data storage. 
> 
> Looking to experiment with the Parquet format, and was looking at CTAS queries as an
import approach.
> 
> Are direct queries over local files meant for an embedded drill, where on a cluster files
should be moved into HDFS first?
> 
> That would make sense as files on one node would be query bound to that local filesystem.

> 
>> On May 26, 2015, at 8:28 PM, Andries Engelbrecht <aengelbrecht@maprtech.com>
wrote:
>> 
>> You can use the HDFS shell
>> hadoop fs -put
>> 
>> To copy from local file system to HDFS
>> 
>> 
>> For more robust mechanisms from remote systems you can look at using NFS, MapR has
a really robust NFS integration and you can use it with the community edition.
>> 
>> 
>> 
>> 
>>> On May 26, 2015, at 5:11 PM, Matt <bsg075@gmail.com> wrote:
>>> 
>>> 
>>> That might be the end goal, but currently I don't have an HDFS ingest mechanism.

>>> 
>>> We are not currently a Hadoop shop - can you suggest simple approaches for bulk
loading data from delimited files into HDFS?
>>> 
>>> 
>>> 
>>> 
>>>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht <aengelbrecht@maprtech.com>
wrote:
>>>> 
>>>> Perhaps I’m missing something here.
>>>> 
>>>> Why not create a DFS plug in for HDFS and put the file in HDFS?
>>>> 
>>>> 
>>>> 
>>>>> On May 26, 2015, at 4:54 PM, Matt <bsg075@gmail.com> wrote:
>>>>> 
>>>>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it appears
text files need to be on all nodes in a cluster?
>>>>> 
>>>>> Using the dfs config below, I am only able to query if a csv file is
on all 4 nodes. If the file is only on the local node and not others, I get errors in the
form of:
>>>>> 
>>>>> ~~~
>>>>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>>>>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table
'root.customer_reviews_1998.csv' not found
>>>>> ~~~
>>>>> 
>>>>> ~~~
>>>>> {
>>>>> "type": "file",
>>>>> "enabled": true,
>>>>> "connection": "file:///",
>>>>> "workspaces": {
>>>>> "root": {
>>>>>  "location": "/localdata/hadoop/stage",
>>>>>  "writable": false,
>>>>>  "defaultInputFormat": null
>>>>> },
>>>>> ~~~
>>>>> 
>>>>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>>>>> 
>>>>>> The storage plugin "location" needs to be the full path to the localdata
>>>>>> directory. This partial storage plugin definition works for the user
named
>>>>>> mapr:
>>>>>> 
>>>>>> {
>>>>>> "type": "file",
>>>>>> "enabled": true,
>>>>>> "connection": "file:///",
>>>>>> "workspaces": {
>>>>>> "root": {
>>>>>> "location": "/home/mapr/localdata",
>>>>>> "writable": false,
>>>>>> "defaultInputFormat": null
>>>>>> },
>>>>>> . . .
>>>>>> 
>>>>>> Here's a working query for the data in localdata:
>>>>>> 
>>>>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>>>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>>>>> . . . . . . . > COLUMNS[2] AS Frequency
>>>>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>>>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of the
Linnean')
>>>>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>>>>> 
>>>>>> An complete example, not yet published on the Drill site, shows in
detail
>>>>>> the steps involved:
>>>>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>>>>> 
>>>>>> 
>>>>>> Kristine Hahn
>>>>>> Sr. Technical Writer
>>>>>> 415-497-8107 @krishahn
>>>>>> 
>>>>>> 
>>>>>>> On Sun, May 24, 2015 at 1:56 PM, Matt <bsg075@gmail.com>
wrote:
>>>>>>> 
>>>>>>> I have used a single node install (unzip and run) to query local
text /
>>>>>>> csv files, but on a 3 node cluster (installed via MapR CE), a
query with
>>>>>>> local files results in:
>>>>>>> 
>>>>>>> ~~~
>>>>>>> sqlline version 1.1.6
>>>>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1,
column 17:
>>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>>> 
>>>>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line 1,
column 17:
>>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>>> ~~~
>>>>>>> 
>>>>>>> Is there a special config for local file querying? An initial
doc search
>>>>>>> did not point me to a solution, but I may simply not have found
the
>>>>>>> relevant sections.
>>>>>>> 
>>>>>>> I have tried modifying the default dfs config to no avail:
>>>>>>> 
>>>>>>> ~~~
>>>>>>> "type": "file",
>>>>>>> "enabled": true,
>>>>>>> "connection": "file:///",
>>>>>>> "workspaces": {
>>>>>>> "root": {
>>>>>>> "location": "/localdata",
>>>>>>> "writable": false,
>>>>>>> "defaultInputFormat": null
>>>>>>> }
>>>>>>> ~~~
>> 

Mime
View raw message