drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] cgivre commented on pull request #2112: DRILL-7534: Convert HTTPD Format Plugin to EVF
Date Fri, 20 Nov 2020 14:31:51 GMT

cgivre commented on pull request #2112:
URL: https://github.com/apache/drill/pull/2112#issuecomment-731204034

   > I did some testing and found something worth discussing regarding the wildcards.
   > _Note about all of these points; I'm fine with just putting a bit of documentation
in place that describes these as known limitations._
   > When I do a "select *" from a table backed by this format and I print the result set
I get for "wildcard" scenarios like the query parameters and the cookies options like these:
   > ```
   > `response_cookies_$` STRUCT<`apache` VARCHAR>,
   > `request_firstline_uri_query_$` STRUCT<`aap` VARCHAR, `res` VARCHAR>,
   > ```
   That is the intended behavior. What should happen is that Drill will create a map of the
parsed cookies and uri query.  If you don't think this is the most effective way of doing
this, I'm definitely open to refactoring it.  
   Just as an FYI, I only chose to do it this way because that's how it was done in the original
Drill/HTTPD integration.   It might be better to flatten these maps and produce actual columns
with the values. 
   > The first thing I noticed is that the actual values in the data are reflected in the
header. I assume this is just the way the RowSet::print() works. Do note that if you have
a large variety of query parameters in your dataset this may become a big list.
   That is correct.
   > What I find is that these wildcards do not work as I expected when comparing what
the underlying parser does.
   > Assuming the URI `/icons/powered_by_rh.png?aap=noot&res=1024x768`
   > When I ask for `request_firstline_uri_query_$` I see in the output something that
looks like what I expect `{"noot", "1024x768"}`
   > However when I directly try to query a deeper entry like `request_firstline_uri_query_aap`
I consistently see a `null` value.
   > This "explicit" way of asking for a values is there because now the system does not
need to url decode the "unwanted" fields (i.e. there is a bit of performance impact if there
are a lot of unwanted fields (query parameters / cookies) in the line at hand.
   The way Drill works is that it creates a vector for every column it finds.   So if you
have a URL with params `field1` and `field2`, you'll get vectors (regardless of whether they
are in a map or not) of `field1` and `field2`. 
   Now, if the next record has `field2` and `field3` the result will be that the `field1`
will be `null` for row2 but fields2 and 3 will be populated. 
   > Note that the underlying parser does support this; the example for Apache Pig makes
this the most clear:
   > https://github.com/nielsbasjes/logparser/blob/master/examples/apache-pig/src/main/pig/demo.pig#L34
   > Now the response cookies are special because they have limited support for a wildcard
in the middle:
   > ```
   > `response_cookies_$_comment` VARCHAR,
   > `response_cookies_$_domain` VARCHAR,
   > `response_cookies_$_expires` TIMESTAMP,
   > `response_cookies_$_path` VARCHAR,
   > `response_cookies_$_value` VARCHAR,
   > ```
   > See https://github.com/nielsbasjes/logparser/blob/master/httpdlog/httpdlog-parser/src/test/java/nl/basjes/parse/httpdlog/ApacheHttpdLogParserTest.java#L161
   > These are intended so you can ask for something like`STRING:response.cookies.jsessionid.path`
   > Here I found that these seem to always return a null also.
   What I think you're getting at here is it might be advantageous to flatten the wildcard
fields rather than putting them in a Drill map and in so doing, create many null columns.
 Is that correct?  If so, my thought here is that the best way to go about that would be to
add a config option called `flattenWildcardFields` and if the user selects that, you would
get a column for every value in the wildcard fields rather than a map. 
   The advantage that I see in doing this is easier queries. For instance if you wanted to
find particular values from a query string, you could do something like:
   SELECT <fields>
   FROM ...
   WHERE request_firstline_uri_query_aap = 1234
   Would that work for you?

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:

View raw message