freemarker-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Van Gundy <>
Subject Re: Auto-Sanitization for FreeMarker
Date Thu, 17 Mar 2016 17:56:15 GMT
On 3/16/16 5:33 AM, Daniel Dekany wrote:
> I would like to figure out if forking is indeed needed for this to
> work. And if so, if that could be avoided in the case of FM 3.


> The way it works in 2.3.24 is simply that anything that's not a
> TemplateMarkupOutputModel (which might should be called
> TemplateSanitizedContantModel) is untrusted (to use your terms), and
> so will be escaped. So strings (TemplateScalarModel-s) are always
> untrusted. When you concatenate trusted and untrusted values, the
> untrusted part is escaped, so the result is trusted. So that's how it
> works in 2.3.24.

The way our modifications work is that any escaping that is not
explicitly requested is delayed until the object is actually being
written out to the output stream because the correct escaping to apply
isn't known until the output context is known.  In order to accomplish
this, we support labeling substrings down to the character level.  Only
data that is not already escaped correctly for the current output
context will be escaped.  For example if the data model specifies:

  data = "javascript:/<script>1/(alert(1337))//</script>"

then rendering the following template:

  <#assign title = "<h1>Hello ${data}</h1>">
  <#assign link = "<a href='${data}'></a>">

will result in:

  <h1>Hello javascript:/&lt;script&gt;1/(alert(1337))//&lt;/script&gt;</h1>
  <a href='#zSoyz'></a>

This illustrates both precise data flow tracking and context-sensitive
sanitization.  First, because the <h1> and <a> tags are static strings
specified by the programmer, they are automatically labeled as
ALWAYS_SAFE, allowing them to be used in any context without further
escaping.  Because "${data}" comes from the data model, it is UNSAFE
(unsanitized) by default.  In memory, the variables are represented as:

title = "<h1>Hello ":ALWAYS_SAFE +
        "javascript:/<script>...</script>":UNSAFE + "</h1>":ALWAYS_SAFE

link = "<a href='":ALWAYS_SAFE +
       "javascript:/<script>...</script>":UNSAFE + "'></a>":ALWAYS_SAFE

When the template is rendered and DollarVariable writes its value to the
output Writer, our custom writer class iterates through each
distinctly-labeled substring in the resulting SimpleScalar.

When rendering title, in output context HTML_PCDATA, "<h1>Hello
":ALWAYS_SAFE is output without sanitization because it is safe for any
context (ALWAYS_SAFE).  It is parsed to determine the new context of the
output buffer, HTML_PCDATA in this case.
"javascript:/<script>...</script>":UNSAFE is HTML-escaped to make it
safe for HTML_PCDATA context, leaving the output buffer context
unchanged (HTML_PCDATA).  And, "</h1>":ALWAYS_SAFE is output without
sanitization because it is safe for any context.  After parsing, the the
output buffer is determined to be in HTML_PCDATA context once again.

Moving on to "${link}", "<a href='":ALWAYS_SAFE is output without
sanitization because it is safe for any context.  After parsing the
string, the output buffer context is determined to be
context, if the protocol scheme is not known to be safe, the value is
transformed into an innocuous URI.  Otherwise, any characters which
could allow the value to break out of the attribute are HTML-escaped.
Because "javascript:/<script>...</script>":UNSAFE does not begin with a
known-safe protocol scheme, it is transformed to the innocuous URI
"#zSoyz".  After parsing the result, the output buffer is determined to
"'></a>":ALWAYS_SAFE is output without escaping because it is always
safe, parsing determines that the output buffer in HTML_PCDATA context.

By delaying sanitization until output, we can ensure that any data that
has not been sanitized correctly for the context where it appears can be
sanitized specifically for that context.

> What marks a value to be trusted or untrusted in your solution?

By default, all values appearing in any template (i.e. static values
written by a template author) and all dynamic values computed only from
static values are considered trusted.  They are labeled ALWAYS_SAFE and
will not be sanitized.

All values coming from the data model are considered untrusted because
they may have been specified by an attacker.  They are labeled UNSAFE by
default.  If values in the data model have been sanitized in some way by
the application or are known to be trusted, when the value is added to
the data model it can be labeled explicitly.  Conversely, the template
author can downgrade the label on any static data.  We've also modified
all escaping built-in functions to label their output appropriately.
For example, if x = "<a>": x?html -> "&lt;a&gt;":SAFE(HTML).

> Especially, why do you track such state for non-string values? After
> all, no hacker can put HTML tags and such into a number or boolean or
> date. Is labelling all type of values is indeed unavoidable, or we
> could just store more detailed information in
> TemplateMarkupOutputModel and that could solve this problem?

Initially we labeled all data types because characters present in a
variety of types have special meaning in various contexts.  For instance
in "<div data-value=${datetime}:${result}/>" a space in the datetime
value formatted as a string could break out of the data-value attribute
and cause ${result} to be treated as an attribute rather than an
attribute value.  Similarly, in "<script>result =
value.match(/${double}/)</script>", the decimal point in a double will
be interpreted as a wildcard if it is not escaped correctly.

In hindsight, I think I might argue that non-string data types like
floats, dates, etc. should _never_ be permitted to change the context
and should always be escaped appropriately for the enclosing context.
In that case, we could probably stop tracking data flow for non-string

> How do you track the output browser's parser state and then where does
> the final sanitization happen? My first idea is passing a Writer to
> Template.process that parses what's written into it, and then ${} has
> to call some special Writer methods to which it can pass the
> sanitization information. The problem with such a solution is that if
> you do output capturing (<#assign x>...</#assign>) or apply any kind
> of filter directive, as they are working with "unlabelled" character
> streams, there will be no browser's state. In FM3, I think we will use
> some kind of FreeMarkerWriter (a term you have used too!) that treats
> the output stream more than just a stream of characters, and the plain
> Writer will be only the final stage. My initial motivation there was
> better white-space handling, but the same thing can also be useful for
> sanitization. That means that we dumb down the output to a mere stream
> of characters as late as possible. Instead, you have a stream of
> objects where some objects are TemplateMarkupOutputModel-s. Just an
> idea anyway.

This is essentially what we do.  When I mentioned FreeMarkerWriter, I
forgot that was a class that we added to make the Writer API aware of
FreeMarker TemplateModels.  If I recall correctly, Environment renders
the template by traversing the AST and for each statement that generates
output, it converts it to a string and writes it to its
 The vanilla FreeMarkerWriter which just wraps and
converts labeled TemplateScalarModels to java.lang.String and writes them.

Since our labeled strings already capture the degree to which each
substring has been sanitized, when auto-sanitization is enabled, the
vanilla writer gets replaced with a sanitizing writer that essentially does:

    public void write(TemplateScalarModel node) throws IOException {
        SimpleScalar ss = null;
        try {
            ss = node.getAsSimpleScalar();
        } catch(TemplateModelException e) {
            throw new IOException(e);
        // Separately sanitize each discretely-labeled substring
        for( SimpleScalar s : ss.substrings() ) {
            String escaped = sanitizeFor(s, context);
            context = parse(escaped, context);

As for handling <#assign x>...</#assign>, in case it's not clear from
the foregoing, a labeled TemplateScalarModel will be assigned to x and
sanitization will only happen when x is actually output (because at that
point, the context *is* defined).

> BTW, what's your fork is based on? If on 2.3.23, as opposed to on the
> 2.3-gae head from a no more than a few months ago, you will find that
> there were a lot of changes due to code formatting clean up...

Unfortunately, it's based on 2.3.20.  (We did the bulk of the work in
late 2013, early 2014 but it's taken a while to borrow enough time to
navigate the open source contribution process.)  Obviously, we'll have
to spend some time bringing it up to date with the current release if it
were going to be a parallel branch of the current 2.3 release.

If you think it's something that FreeMarker might potentially be
interested in, even for 3.0, I think that gives me enough information to
push along our contribution request.  At that point, we would be able to
share the code and have concrete discussions about design and
implementation rather than the more hand-wavy stuff above.


View raw message