drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ron Cecchini <roncecch...@comcast.net>
Subject Re: Drill + Mongo
Date Fri, 06 Mar 2020 13:54:40 GMT
I forgot to mention that Drill had one sweet trick that Presto didn't re: Mongo -

I assume (w/o digging into to the code) that you must've "aliased" an un-filtered SELECT COUNT(*)
to use Mongo's db.<collection>.count() (or something...) because it would return immediately
with the correct answer, whereas Presto was again doing some kind of big scan or whatever
and never returning.

> On March 6, 2020 at 12:25 AM Ron Cecchini <roncecchini@comcast.net> wrote:
> 
> 
> Thanks as always, Paul.
> 
> Well, technically Drill performs *better* than the (official) Presto, because the prestosql/presto
image's mongodb connector is broken at the current time...
> 
> I've been using the starburstdata/presto image because it's apparently more stable. 
I'm still a little confused how they achieve that because I *thought* they were building from
the same (prestosql) code base, with just some  configuration changes, but I think they must've
modified code as well.
> 
> Anyway, as for what kind of query, it's just a fairly vanilla query with a few range
tests.
> 
> The (cleaned up and sanitized) Mongo query looks something like:
> 
> { "$and" :
>   [ 
>     { "id" : { "$lt"  : 1000 } },
>     { "d"  : { "$gte" : { "$date" : 1578000000000 } } },
>     { "d"  : { "$lte" : { "$date" : 1578200000000 } } },
>     { "x"  : { "$gte" : 1.0 } },
>     { "x"  : { "$lte" : 4.0 } },
>     { "y"  : { "$gte" : 5.0 } },
>     { "y"  : { "$lte" : 8.0 } }
>   ]
> }
> 
> and then the SQL becomes something simple like:
> 
> SELECT * FROM mongodb.<schema>."<table.name>"
> WHERE ( id < 1000 )
>   AND ( d BETWEEN '2020-01-01T12:00:00Z' AND '2020-01-02T12:00:00Z' )
>   AND ( x BETWEEN 1.0 AND 4.0 )
>   AND ( y BETWEEN 5.0 AND 8.0 )
> OFFSET 0 LIMIT 5000
> 
> > On March 5, 2020 at 2:31 AM Paul Rogers <par0328@yahoo.com.INVALID> wrote:
> > 
> > 
> > Hi Ron,
> > 
> > Sounds like the good news is that Drill is about as good as Presto when querying
Mongo. Sounds like the bad news is that both are equally deficient. On the other hand, the
other good news is that better performance is just a matter of adding additional planning
rules (with perhaps some Mongo metadata.)
> > 
> > 
> > The Wikipedia page for Mongo [1] suggests several features that Mongo (Simba) is
probably using in their own JDBC driver, but which Drill probably does not use:
> > 
> > * Primary and secondary indices
> > * Field, range query, and regular-expression searches
> > * User-defined JavaScript functions
> > * Three ways to perform aggregation: the aggregation pipeline, the map-reduce function,
and single-purpose aggregation methods.
> > 
> > My guess is that the Mongo JDBC driver does thorough planning to exploit each of
the above functions, while Drill may use only a few. We already noted other weaknesses in
the filter push-down code for the Drill Mongo plugin. Seems fixable if we can put in the effort.
> > 
> > 
> > Seems Mongo provides a Simba JDBC driver, which is proprietary, so no source code
is available we could use as a "cheat sheet" to see what's what.
> > 
> > 
> > Just out of curiosity, what is the query that works well with the Mongo JDBC driver,
but poorly with Drill?
> > 
> > Anybody know more about how Mongo works and what Drill might be missing?
> > 
> > 
> > Thanks,
> > - Paul
> > 
> > [1] https://en.wikipedia.org/wiki/MongoDB
> > 
> > 
> > 
> >  
> > 
> >     On Wednesday, March 4, 2020, 9:28:44 PM PST, Ron Cecchini <roncecchini@comcast.net>
wrote:  
> >  
> >  Hi, guys.
> > 
> > This is actually more of a Mongo question than a Drill-specific question as it also
applies to Presto + Mongo, and the vanilla Mongo shell as well.
> > 
> > I'm asking here, though, because, well, I'm curious, and because you're the database
geniuses...
> > 
> > So, I essentially get why a NoSQL database, in general, wouldn't be as performant
as a SQL one at "relational" things.  From what I gather, there are denormalization and optimization
techniques and tricks you can use to speed up a Mongo query and so forth, but my question
is:
> > 
> > Why is it that any Drill/Presto + Mongo CLI or JDBC query against a large collection
(100-200 million documents) that includes even a single WHERE clause, or the Mongo equivalent
query made via Mongo shell, basically never returns and has to be killed, whereas the same
(Mongo equivalent) query against the same collection made via *Mongo's* JDBC driver takes
only a second or two?
> > 
> > Is the Mongo JDBC using some indexing that the others aren't?  (But how would that
explain Mongo shell's non-performance...  Why doesn't Mongo shell just make a JDBC call to
the db...)
> > 
> > Thank you in advance for educating me.
> > 
> > Ron
> >

Mime
View raw message