drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@apache.org>
Subject Re: Drill data and database locality
Date Sun, 11 Jun 2017 06:32:02 GMT
Ivan,
yes, the scans for the various data sources are expected to use locality
information to perform the table scan.

If you only run the query against mongodb (the right side of union-all) and
the foreman is on server B, does it do the table scan on server A  which is
hosting the mongodb table ?  It is supposed to use locality information, so
if it is not then most likely there is a bug.    Does the problem only
occur when the query is accessing multiple data sources including mongo ?

You could  enable DEBUG logging level in logback.xml and check the output
in drillbit.log file.. look for entries for MongoGroupScan that should
provide insight into whether the affinity based assignment is working
correctly.  Each data source has its own implementation of determining
'affinity'.

You may have already seen this, but just in case..here's the doc page for
the mongo plugin: https://drill.apache.org/blog/2014/11/19/sql-on-mongodb/
I am not familiar with this plugin but hopefully the original authors
(listed on that page) can chime in.

-Aman


On Sat, Jun 10, 2017 at 10:02 AM, Ivan Kovacevic <ivankov.ikl@gmail.com>
wrote:

> Greetings,
>
> I have two servers. The first server (A) contains the zookeeper, a mongodb
> database and a drillbit. The second server (B) contains a hadoop
> distribution with several hive tables, a postgresql database and the other
> drillbit. Both drillbits can see eachother on the drill homepages, since
> they are both connected to the zookeeper on server A. When a query like the
> one on the bottom is run (the view dfs.lineorder consists of the hive and
> postgresql data on server B), drillbit B becomes the foreman and executes
> everything. It does not delegate the parts dealing with mongodb tables to
> the drillbit on server A, so 90% of the time required to run the query
> (approx. 30 minutes) is spent on sending the mongodb collection from server
> A to server B. Is there a way to force drillbit B to delegate the mongodb
> part of the query to drillbit B, or could this be a misconfiguration
> problem?
> Also, can drillbits be configured to access different databases (e.g. when
> each drillbit has access to a distinct subnet, so that not all drillbits
> can access the same databases)?
> Thank you.
>
> Sample query:
> SELECT SUM(revenue) AS revenue
> FROM (
>     SELECT SUM(lo_extendedprice*lo_discount) AS revenue
>     FROM dfs.tmp.lineorder, dfs.tmp.`date`
>     WHERE lo_orderdate = d_datekey
>     AND d_year = 1993
>     AND lo_discount BETWEEN 1 AND 3
>     AND lo_quantity < 25
> UNION ALL
>     SELECT SUM(lo_extendedprice * lo_discount) AS revenue
>     FROM mongo.test.ssb_europe ssb
>     WHERE ssb.orderdate.d_year = 1993
>     AND lo_discount BETWEEN 1 AND 3
>     AND lo_quantity < 25
> );
>
> Configuration (drill-override.conf):
>
> On drillbit A (Windows Server 2008):
>
> drill.exec: {
>   cluster-id: "drillbits1",
>   zk.connect: "serverA:2181",
> impersonation: {
>     enabled: true,
>     max_chained_user_hops: 3
>   }
> }
>
> On drillbit B (Cloudera CDH 5.8.0):
> drill.exec: {
>   cluster-id: "drillbits1",
>   zk.connect: "serverA:2181",
>   impersonation: {
>     enabled: true,
>     max_chained_user_hops: 3
>   }
> }
>
>
> Best regards,
> Ivan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message