drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: Query local files on cluster? [Beginner]
Date Wed, 27 May 2015 15:44:09 GMT
OK, that is the simplest way to get going. And see how the solution works for you. It can be
a little confusing between local FS and working with a cluster.

I have found that dealing with large data volumes it worked much easier to use the NFS on
the MapR cluster to directly move data to the DFS and bypass the local FS and then to DFS.
Skips a step and also a lot more robust and faster to get the data directly to the DFS.



On May 27, 2015, at 8:37 AM, Matt <bsg075@gmail.com> wrote:

>> Drill can process a lot of data quickly, and for best performance and consistency
you will likely find that the sooner you get the data to the DFS the better.
> 
> Already most of the way there. Initial confusion came from the features to query the
local / native filesystem, and how that does not fit a distributed Drill cluster well. In
other words its really an embedded / single-node Drill feature.
> 
> Currently using the approach of doing a put from local filsystem into hdfs, then CTAS
into Parquet, if only for simplicity in testing (not performance).
> 
> Thanks!
> 
>> On 27 May 2015, at 11:29, Andries Engelbrecht wrote:
>> 
>> You will be better off to use the Drill cluster as a whole vs trying to play with
local vs DFS storage.
>> 
>> A couple of ideas:
>> As previously mentioned you can use the robust NFS on MapR to easily place the CSV/files
on the DFS, and then use Drill with CTAS to convert the files to Parquet on the DFS.
>> 
>> You can set up a remote NFS server and map the local FS on each node to the same
NFS mount point to the remote NFS server, this will the files will be consistently available
to the Drillbits in the cluster and you can do CTAS to create parquet file son the DFS. This
however will likely be a lot slower than the first option, as the NFS server BW will become
a bottleneck if you have a number of drillbits in the cluster.
>> 
>> Just copy the files to one node in the cluster and then use hadoop fs to put the
files in the DFS, and then do the CTAS from DFS to parquet DFS.
>> 
>> You can even place the data on S3 and then query and CTAS from there, however security
and bandwidth may be a concern for large data volumes, pending the use case.
>> 
>> I really think you will find the first option the most robust and fastest in the
long run. You can point Drill at any FS source as long as it is consistent to all nodes in
the cluster, but keep in mind that Drill can process a lot of data quickly, and for best performance
and consistency you will likely find that the sooner you get the data to the DFS the better.
>> 
>> 
>> 
>> 
>>> On May 26, 2015, at 5:58 PM, Matt <bsg075@gmail.com> wrote:
>>> 
>>> Thanks, I am incorrectly conflating the file system with data storage.
>>> 
>>> Looking to experiment with the Parquet format, and was looking at CTAS queries
as an import approach.
>>> 
>>> Are direct queries over local files meant for an embedded drill, where on a cluster
files should be moved into HDFS first?
>>> 
>>> That would make sense as files on one node would be query bound to that local
filesystem.
>>> 
>>>> On May 26, 2015, at 8:28 PM, Andries Engelbrecht <aengelbrecht@maprtech.com>
wrote:
>>>> 
>>>> You can use the HDFS shell
>>>> hadoop fs -put
>>>> 
>>>> To copy from local file system to HDFS
>>>> 
>>>> 
>>>> For more robust mechanisms from remote systems you can look at using NFS,
MapR has a really robust NFS integration and you can use it with the community edition.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 26, 2015, at 5:11 PM, Matt <bsg075@gmail.com> wrote:
>>>>> 
>>>>> 
>>>>> That might be the end goal, but currently I don't have an HDFS ingest
mechanism.
>>>>> 
>>>>> We are not currently a Hadoop shop - can you suggest simple approaches
for bulk loading data from delimited files into HDFS?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 26, 2015, at 8:04 PM, Andries Engelbrecht <aengelbrecht@maprtech.com>
wrote:
>>>>>> 
>>>>>> Perhaps I’m missing something here.
>>>>>> 
>>>>>> Why not create a DFS plug in for HDFS and put the file in HDFS?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On May 26, 2015, at 4:54 PM, Matt <bsg075@gmail.com> wrote:
>>>>>>> 
>>>>>>> New installation with Hadoop 2.7 and Drill 1.0 on 4 nodes, it
appears text files need to be on all nodes in a cluster?
>>>>>>> 
>>>>>>> Using the dfs config below, I am only able to query if a csv
file is on all 4 nodes. If the file is only on the local node and not others, I get errors
in the form of:
>>>>>>> 
>>>>>>> ~~~
>>>>>>> 0: jdbc:drill:zk=es05:2181> select * from root.`customer_reviews_1998.csv`;
>>>>>>> Error: PARSE ERROR: From line 1, column 15 to line 1, column
18: Table 'root.customer_reviews_1998.csv' not found
>>>>>>> ~~~
>>>>>>> 
>>>>>>> ~~~
>>>>>>> {
>>>>>>> "type": "file",
>>>>>>> "enabled": true,
>>>>>>> "connection": "file:///",
>>>>>>> "workspaces": {
>>>>>>> "root": {
>>>>>>> "location": "/localdata/hadoop/stage",
>>>>>>> "writable": false,
>>>>>>> "defaultInputFormat": null
>>>>>>> },
>>>>>>> ~~~
>>>>>>> 
>>>>>>>> On 25 May 2015, at 20:39, Kristine Hahn wrote:
>>>>>>>> 
>>>>>>>> The storage plugin "location" needs to be the full path to
the localdata
>>>>>>>> directory. This partial storage plugin definition works for
the user named
>>>>>>>> mapr:
>>>>>>>> 
>>>>>>>> {
>>>>>>>> "type": "file",
>>>>>>>> "enabled": true,
>>>>>>>> "connection": "file:///",
>>>>>>>> "workspaces": {
>>>>>>>> "root": {
>>>>>>>> "location": "/home/mapr/localdata",
>>>>>>>> "writable": false,
>>>>>>>> "defaultInputFormat": null
>>>>>>>> },
>>>>>>>> . . .
>>>>>>>> 
>>>>>>>> Here's a working query for the data in localdata:
>>>>>>>> 
>>>>>>>> 0: jdbc:drill:> SELECT COLUMNS[0] AS Ngram,
>>>>>>>> . . . . . . . > COLUMNS[1] AS Publication_Date,
>>>>>>>> . . . . . . . > COLUMNS[2] AS Frequency
>>>>>>>> . . . . . . . > FROM dfs.root.`mydata.csv`
>>>>>>>> . . . . . . . > WHERE ((columns[0] = 'Zoological Journal
of the Linnean')
>>>>>>>> . . . . . . . > AND (columns[2] > 250)) LIMIT 10;
>>>>>>>> 
>>>>>>>> An complete example, not yet published on the Drill site,
shows in detail
>>>>>>>> the steps involved:
>>>>>>>> http://tshiran.github.io/drill/docs/querying-plain-text-files/#example-of-querying-a-tsv-file
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Kristine Hahn
>>>>>>>> Sr. Technical Writer
>>>>>>>> 415-497-8107 @krishahn
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sun, May 24, 2015 at 1:56 PM, Matt <bsg075@gmail.com>
wrote:
>>>>>>>>> 
>>>>>>>>> I have used a single node install (unzip and run) to
query local text /
>>>>>>>>> csv files, but on a 3 node cluster (installed via MapR
CE), a query with
>>>>>>>>> local files results in:
>>>>>>>>> 
>>>>>>>>> ~~~
>>>>>>>>> sqlline version 1.1.6
>>>>>>>>> 0: jdbc:drill:> select * from dfs.`testdata.csv`;
>>>>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to
line 1, column 17:
>>>>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>>>>> 
>>>>>>>>> 0: jdbc:drill:> select * from dfs.`/localdata/testdata.csv`;
>>>>>>>>> Query failed: PARSE ERROR: From line 1, column 15 to
line 1, column 17:
>>>>>>>>> Table 'dfs./localdata/testdata.csv' not found
>>>>>>>>> ~~~
>>>>>>>>> 
>>>>>>>>> Is there a special config for local file querying? An
initial doc search
>>>>>>>>> did not point me to a solution, but I may simply not
have found the
>>>>>>>>> relevant sections.
>>>>>>>>> 
>>>>>>>>> I have tried modifying the default dfs config to no avail:
>>>>>>>>> 
>>>>>>>>> ~~~
>>>>>>>>> "type": "file",
>>>>>>>>> "enabled": true,
>>>>>>>>> "connection": "file:///",
>>>>>>>>> "workspaces": {
>>>>>>>>> "root": {
>>>>>>>>> "location": "/localdata",
>>>>>>>>> "writable": false,
>>>>>>>>> "defaultInputFormat": null
>>>>>>>>> }
>>>>>>>>> ~~~
>>>> 

Mime
View raw message