metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Miklavcic <michael.miklav...@gmail.com>
Subject Re: [DISCUSS] Charset PR merge effects
Date Mon, 25 Feb 2019 20:27:04 GMT
Justin, I like your proposal. I just had a couple thoughts that I think are
worth pointing out. I'm not 100% clear about this point, but I don't
believe we currently handle the possibility of various external data
sources with heterogeneous encoding schemes in our Kafka topics. As it
stands, I believe it's up to users to ensure their external systems have
the same encoding as the machines hosting Metron (since we use the default
charset). It seems we haven't run into issues with this yet, but it should
be noted that if we change to UTF-8 while the external systems continue to
use a different default, e.g. ISO-8859-1, things will work fine for ASCII
chars 0-127, and then break for 128-255 where ISO-8859-1 still uses a
single-byte representation, but UTF-8 uses a 2-byte representation.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1
https://en.wikipedia.org/wiki/UTF-8

I don't believe this change breaks anything - HDFS/Hadoop already uses
UTF-8 as its default encoding and we'd likely have seen or heard about
issues, but it's worth mentioning that there may be some gotchas at the
edges outside of Metron. I did a quick search in your PR and didn't see any
README updates - the only thing I might recommend is a release notice that
Metron is changing our encoding handling to be more explicitly opinionated
and will use UTF-8 rather than the default charset.

Best,
Mike


On Wed, Feb 20, 2019 at 8:19 AM Justin Leet <justinjleet@gmail.com> wrote:

> Hey all,
>
> I wanted to bring a bit of attention to a change and its effects before I
> push to master.
>
> PR#1341 <https://github.com/apache/metron/pull/1341> Removes all uses of
> the default charset (which is platform dependent) and moves everything to
> UTF-8. This PR currently has a +1, but obviously any new input is more than
> welcome.
>
> Going forward, it will be a compiler error to use the default charset (e.g.
> getBytes without StandardCharsets.UTF_8, or using APIs that don't allow an
> encoding to be passed. I want to call attention to the fact that existing
> PRs may break after this merge.
>
> Post merging this PR, I would send another email to the list (and Slack and
> so on), to note that existing PRs should merge in master (or at least
> determine if they need to).
>
> Does anyone have any objections or concerns? Should I be more proactive
> about a heads up on PRs, or are we okay with what I'm proposing?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message