drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt <bsg...@gmail.com>
Subject Re: Query local files on cluster? [Beginner]
Date Wed, 27 May 2015 15:37:07 GMT
> Drill can process a lot of data quickly, and for best performance and 
> consistency you will likely find that the sooner you get the data to 
> the DFS the better.

Already most of the way there. Initial confusion came from the features 
to query the local / native filesystem, and how that does not fit a 
distributed Drill cluster well. In other words its really an embedded / 
single-node Drill feature.

Currently using the approach of doing a put from local filsystem into 
hdfs, then CTAS into Parquet, if only for simplicity in testing (not 
performance).

Thanks!

On 27 May 2015, at 11:29, Andries Engelbrecht wrote:

> You will be better off to use the Drill cluster as a whole vs trying 
> to play with local vs DFS storage.
>
> A couple of ideas:
> As previously mentioned you can use the robust NFS on MapR to easily 
> place the CSV/files on the DFS, and then use Drill with CTAS to 
> convert the files to Parquet on the DFS.
>
> You can set up a remote NFS server and map the local FS on each node 
> to the same NFS mount point to the remote NFS server, this will the 
> files will be consistently available to the Drillbits in the cluster 
> and you can do CTAS to create parquet file son the DFS. This however 
> will likely be a lot slower than the first option, as the NFS server 
> BW will become a bottleneck if you have a number of drillbits in the 
> cluster.
>
> Just copy the files to one node in the cluster and then use hadoop fs 
> to put the files in the DFS, and then do the CTAS from DFS to parquet 
> DFS.
>
> You can even place the data on S3 and then query and CTAS from there, 
> however security and bandwidth may be a concern for large data 
> volumes, pending the use case.
>
> I really think you will find the first option the most robust and 
> fastest in the long run. You can point Drill at any FS source as long 
> as it is consistent to all nodes in the cluster, but keep in mind that 
> Drill can process a lot of data quickly, and for best performance and 
> consistency you will likely find that the sooner you get the data to 
> the DFS the better.
>
>
>
>
>> On May 26, 2015, at 5:58 PM, Matt <bsg075@gmail.com> wrote:
>>
>> Thanks, I am incorrectly conflating the file system with data 
>> storage.
>>
>> Looking to experiment with the Parquet format, and was looking at 
>> CTAS queries as an import approach.
>>
>> Are direct queries over local files meant for an embedded drill, 
>> where on a cluster files should be moved into HDFS first?
>>
>> That would make sense as files on one node would be query bound to 
>> that local filesystem.
>>
>>> On May 26, 2015, at 8:28 PM, Andries Engelbrecht 
>>> <aengelbrecht@maprtech.com> wrote:
>>>
>>> You can use the HDFS shell
>>> hadoop fs -put
>>>
>>> To copy from local file system to HDFS
>>>
>>>
>>> For more robust mechanisms from remote systems you can look at using 
>>> NFS, MapR has a really robust NFS integration and you can use it 
>>> with the community edition.
>>>
>>>
>>>
>>>
>>>> On May 26, 2015, at 5:11 PM, Matt <bsg075@gmail.com> wrote:
>>>>
>>>>
>>>> That might be the end goal, but currently I don't have an HDFS 
>>>> ingest mechanism.
>>>>
>>>> We are not currently a Hadoop shop - can you suggest simple 
>>>> approaches for bulk loading data from delimited files into HDFS?
>>>>
>>>>
>>>>
>>>>
>>>>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht 
>>>>> <aengelbrecht@maprtech.com> wrote:
>>>>>
>>>>> Perhaps I’m missing something here.
>>>>>
>>>>> Why not create a DFS plug in for HDFS and put the file in HDFS?
>>>>>
>>>>>
>>>>>
>>>>>> On May 26, 2015, at 4:54 PM, Matt <bsg075@gmail.com> wrote:
>>>>>>
>>>>>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it 
>>>>>> appears text files need to be on all nodes in a cluster?
>>>>>>
>>>>>> Using the dfs config below, I am only able to query if a csv file

>>>>>> is on all 4 nodes. If the file is only on the local node and not

>>>>>> others, I get errors in the form of:
>>>>>>
>>>>>> ~~~
>>>>>> 0: jdbc:drill:zk=es05:2181> select * from 
>>>>>> root.`customer_reviews_1998.csv`;
>>>>>> Error: PARSE ERROR: From line 1, column 15 to line 1, column 18:

>>>>>> Table 'root.customer_reviews_1998.csv' not found
>>>>>> ~~~
>>>>>>
>>>>>> ~~~
>>>>>> {
>>>>>> "type": "file",
>>>>>> "enabled": true,
>>>>>> "connection": "file:///",
>>>>>> "workspaces": {
>>>>>> "root": {
>>>>>> "location": "/localdata/hadoop/stage",
>>>>>> "writable": false,
>>>>>> "defaultInputFormat": null
>>>>>> },
>>>>>> ~~~
>>>>>>
>>>>>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>>>>>>
>>>>>>> The storage plugin "location" needs to be the full path to the

>>>>>>> localdata
>>>>>>> directory. This partial storage plugin definition works for the

>>>>>>> user named
>>>>>>> mapr:
>>>>>>>
>>>>>>> {
>>>>>>> "type": "file",
>>>>>>> "enabled": true,
>>>>>>> "connection": "file:///",
>>>>>>> "workspaces": {
>>>>>>> "root": {
>>>>>>> "location": "/home/mapr/localdata",
>>>>>>> "writable": false,
>>>>>>> "defaultInputFormat": null
>>>>>>> },
>>>>>>> . . .
>>>>>>>
>>>>>>> Here's a working query for the data in localdata:
>>>>>>>
>>>>>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>>>>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>>>>>> . . . . . . . > COLUMNS[2] AS Frequency
>>>>>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>>>>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal of
the 
>>>>>>> Linnean')
>>>>>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>>>>>>
>>>>>>> An complete example, not yet published on the Drill site, shows

>>>>>>> in detail
>>>>>>> the steps involved:
>>>>>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>>>>>>
>>>>>>>
>>>>>>> Kristine Hahn
>>>>>>> Sr. Technical Writer
>>>>>>> 415-497-8107 @krishahn
>>>>>>>
>>>>>>>
>>>>>>>> On Sun, May 24, 2015 at 1:56 PM, Matt <bsg075@gmail.com>
wrote:
>>>>>>>>
>>>>>>>> I have used a single node install (unzip and run) to query

>>>>>>>> local text /
>>>>>>>> csv files, but on a 3 node cluster (installed via MapR CE),
a 
>>>>>>>> query with
>>>>>>>> local files results in:
>>>>>>>>
>>>>>>>> ~~~
>>>>>>>> sqlline version 1.1.6
>>>>>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line
1, 
>>>>>>>> column 17:
>>>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>>>>
>>>>>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to line
1, 
>>>>>>>> column 17:
>>>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>>>> ~~~
>>>>>>>>
>>>>>>>> Is there a special config for local file querying? An initial

>>>>>>>> doc search
>>>>>>>> did not point me to a solution, but I may simply not have
found 
>>>>>>>> the
>>>>>>>> relevant sections.
>>>>>>>>
>>>>>>>> I have tried modifying the default dfs config to no avail:
>>>>>>>>
>>>>>>>> ~~~
>>>>>>>> "type": "file",
>>>>>>>> "enabled": true,
>>>>>>>> "connection": "file:///",
>>>>>>>> "workspaces": {
>>>>>>>> "root": {
>>>>>>>> "location": "/localdata",
>>>>>>>> "writable": false,
>>>>>>>> "defaultInputFormat": null
>>>>>>>> }
>>>>>>>> ~~~
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message