nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Updated] (NUTCH-2563) HTTP header spellchecking issues
Date Fri, 08 Jun 2018 15:14:00 GMT


Sebastian Nagel updated NUTCH-2563:
    Affects Version/s: 1.14

> HTTP header spellchecking issues
> --------------------------------
>                 Key: NUTCH-2563
>                 URL:
>             Project: Nutch
>          Issue Type: Sub-task
>    Affects Versions: 1.14
>            Reporter: Gerard Bouchar
>            Priority: Major
>             Fix For: 1.15
> {color:#333333}When reading http headers, for each header, the SpellCheckedMetadata class
computes a Levenshtein distance between it and every  known header in the HttpHeaders interface.
Not only is that slow, non-standard, and non-conform to browsers' behavior, but it also causes
bugs and prevents us from accessing the real headers sent by the HTTP server.{color}
>  * {color:#333333}Example: [!443358/] . The server sends a *Client-Transfer-Encoding:
chunked* header, but SpellCheckedMetadata corrects it to *Transfer-Encoding: chunked*. Then,
HttpResponse (in protocol-http) tries to read the HTTP body as chunked, whereas it is not.{color}
> {color:#333333}I personally think that HTTP header spell checking is a bad idea, and
that this logic should be completely removed. But if it were to be kept, the threshold (SpellCheckedMetadata.TRESHOLD_DIVIDER)
should be higher (we internally set it to 5 as a temporary fix for this issue){color}

This message was sent by Atlassian JIRA

View raw message