drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] nielsbasjes commented on pull request #2112: DRILL-7534: Convert HTTPD Format Plugin to EVF
Date Thu, 19 Nov 2020 09:11:45 GMT

nielsbasjes commented on pull request #2112:
URL: https://github.com/apache/drill/pull/2112#issuecomment-730235084


   I did some testing and found something worth discussing regarding the wildcards.
   
   _Note about all of these points; I'm fine with just putting a bit of documentation in place
that describes these as known limitations._
   
   When I do a  "select *" from a table backed by this format and I print the result set I
get for "wildcard" scenarios like the query parameters and the cookies options like these:
   ```
   `response_cookies_$` STRUCT<`apache` VARCHAR>,
   `request_firstline_uri_query_$` STRUCT<`aap` VARCHAR, `res` VARCHAR>,
   ```
   
   The first thing I noticed is that the actual values in the data are reflected in the header.
I assume this is just the way the RowSet::print() works. Do note that if you have a large
variety of query parameters in your dataset this may become a big list.  
   
   What I find is that these wildcards do not work as I expected when comparing what the underlying
parser does.
   
   
   Assuming the URI `/icons/powered_by_rh.png?aap=noot&res=1024x768`
   
   When I ask for `request_firstline_uri_query_$` I see in the output something that looks
like what I expect `{"noot", "1024x768"}`
   However when I directly try to query a deeper entry like `request_firstline_uri_query_aap`
I consistently see a `null` value.
   
   This "explicit" way of asking for a values is there because now the system does not need
to url decode the "unwanted" fields (i.e. there is a bit of performance impact if there are
a lot of unwanted fields (query parameters / cookies) in the line at hand.
   
   Note that the underlying parser does support this; the example for Apache Pig makes this
the most clear:
   https://github.com/nielsbasjes/logparser/blob/master/examples/apache-pig/src/main/pig/demo.pig#L34
   
   Now the response cookies are special because they have limited support for a wildcard in
the middle:
   ```
   `response_cookies_$_comment` VARCHAR,
   `response_cookies_$_domain` VARCHAR,
   `response_cookies_$_expires` TIMESTAMP,
   `response_cookies_$_path` VARCHAR,
   `response_cookies_$_value` VARCHAR,
   ```
   See https://github.com/nielsbasjes/logparser/blob/master/httpdlog/httpdlog-parser/src/test/java/nl/basjes/parse/httpdlog/ApacheHttpdLogParserTest.java#L161
   
   These are intended so you can ask for something like`STRING:response.cookies.jsessionid.path`
   
   Here I found that these seem to always return a null also.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message