A commit message from our repo
Mal Pinder, one of our Developers, shares a commit message from the code repository for our main Ruby on Rails application, which was made a few weeks ago. As well as being an interesting story about debugging a strange problem with user input, it’s also an example of our policy of writing good commit messages.
The diff of the commit itself was a very small change – only a few lines of code – but we think writing in-depth commit messages help us understand our code better. Commit messages that detail why code was introduced, how it works, and what alternatives were considered, form a very useful reference for future developers (including our future selves) who want to make changes to it.
It’s been edited for formatting, and some links to our internal sites and Git history have been expanded in place so that you can understand the reference, but otherwise it’s unchanged.
Scrub invalid UTF-8 bytes from comment text params
TL;DR Add a concern to all comment controllers that scrubs the
comment text parameter, replacing any invalid bytes with a
replacement character.
Some time ago we had a couple of errors reported in Honeybadger
with the message “ArgumentError: invalid byte sequence in UTF-8”
when saving a comment [1].
UTF-8 represents characters using between 1 and 4 bytes, but not
all combinations of bytes map onto valid characters. [2] These are
called ‘invalid byte sequences’.
In our stack, user input is passed through to Rack, which turns
them into a params hash, then the Rails controller uses those
params to create the comment. As Rack doesn’t care about the data
passed to it, the invalid string makes it all the way down to the
model layer. The first comment validation is checking that the
text attribute is not blank; at this point Ruby checks if the
string is valid UTF-8, and then raises the error.
Because Honeybadger can’t display the invalid bytes, we didn’t
have enough information to solve the error right then, so we added
some extra logging and waited for it to happen again. [3]
There were no errors for a while, but then a very popular course
opened, and we suddenly got quite a few more, up to a grand total
of 25. From looking at the additional logs we’d added, we could
get some more information about all these errors:
- Most of them had user agents indicating browsers on older
versions of Android
- The invalid bytes always appeared after words or phrases,
usually at the end of the comment, and the text read fine
without them
- All but one had a flash alert reading "Your comment could not be
created: Text must not contain unsupported characters"
- The invalid bytes were always “%ED%A0%BD”
The Flash Alert
All text in our database are stored in UTF-8, but for various
reasons [4], we can’t save four-byte UTF-8 characters, such as
emoji. Therefore we have a `UnicodeCharactersValidator` class,
which makes sure any text entered into our database doesn’t
contain them.
This validator gets run before a comment is saved, but *after* the
strings are coerced and checked as valid UTF-8. It puts that error
message on the model, which the controller then turns into a flash
alert, which is shown back to the user (along with their comment).
The presence of this flash message means that the users must have
first submitted a comment with valid UTF-8, but at least one four-
byte character. It’s the on the re-submit of the comment that this
error is raised.
Those Invalid Bytes
“%ED%A0%BD” is the url-safe hex encoding of the UTF-16 high
surrogate “D83D”.
>> ((0xED & 0xF) << 12) + ((0xA0 & 0x3F) << 6) + (0xBD & 0x3F)
=> 55357
>> _.to_s(16)
=> "d83d"
In UTF-16, values in the range [D800, DBFF] are ‘high surrogates’
which need to be followed by a low surrogate in the range [DC00,
DFFFF] to represent a character.
The first code point that would begin D83D is thus encoded as D83D
DC00, giving:
>> 0x10000 + ((0xD83D - 0xD800) << 10)
+ ((0xDC00 - 0xDC00) & 0x3FF)
=> 128000
>> _.to_s 16
=> "1f400"
or, Unicode Character 'RAT' (U+1F400)[5]. This is the beginning of
a long block of codepoints assigned to emojis, including most of
the common ones you can find on your phone keyboard. All the emoji
in this range begin D83D when encoded as surrogate pairs.
However, the request body containing %ED%A0%BD is a string encoded
as `application/x-www-form-urlencoded,` which works by encoding a
string as UTF-8 and turning any multi-byte sequences (and some
characters that need escaping like & and =) into %XX strings. If
you put multi-byte characters into a form and post it, you will
see a UTF-8 encoding of those characters arrive at the server -
for instance, `U+1F602` would properly be encoded as
`%F0%9F%98%82`, not as `%ED%A0%BD%ED%B8%82`.
It is not valid to send values corresponding to surrogate pairs as
UTF-8, because they are not necessary in UTF-8. Pairs are needed
in UTF-16, for codepoints that don't fit in 2 bytes. UTF-8, being
a variable-length encoding, doesn't need them. So ED A0 BD is an
invalid byte sequence in UTF-8.
Most browsers won't allow unpaired surrogates to be entered, but
you can generate them using Javascript, which represents all
astral symbols (4-byte characters) as two characters:
> s = String.fromCharCode(0xD83D)
'�'
> s.length
1
> s = String.fromCharCode(0xD83D, 0xDE02)
'?'
> s.length
2
In Chrome, inserting `String.fromCharCode(0xD83D)` into a form and
submitting it sends Unicode Character 'REPLACEMENT CHARACTER'
(U+FFFD)[6] in its place. But some user agents might retain the
'character' and send it as `application/x-www-form-urlencoded`.
What we think is going on
We know users try to enter emoji in comments (we get occasional
UserVoice messages about it), but most see the warning, delete the
emoji, and post their comment.
Given the positioning of the invalid bytes (between phrases, or at
the end) and the flash message, we think this problem is also due
to people trying to use emoji.
However, something about these user agents is breaking the emoji
into two 'characters' and only sending us the first part, rather
than either deleting both parts, or replacing the remaining
invalid bytes with the replacement symbol. This could be due to
any number of reasons; perhaps they have a misbehaving browser,
javascript plugin, or keyboard app, or their keyboard doesn’t
support 4-byte characters but they work around it by copy-pasting
emoji in. (It could also have been truncated due to a proxy or
network error, but since the `content-length` header is correct,
this is unlikely).
Because this is basically a problem caused by a misbehaving client
sending us an invalid request, I found it impossible to replicate
using an actual device. However it is possible to write controller
tests that approximate the situation, since it’s only our model
layer that checks the string content.
How to fix this
Now we knew what the actual problem is, I had several options on
how to proceed.
Semantically, since it’s invalid data that’s being sent, the page
could respond with an HTTP `400` or `406`. However this would show
an error page to the user, and discard the comment they’d tried to
make.
In general it’s best to persist user input wherever possible;
comments can be very long and/or represent a lot of effort on the
user’s part, and it’s a terrible and frustrating experience to
press submit and watch everything you wrote vanish.
Given that the best-guess for what’s happening is that users are
trying to delete these emoji anyway, and there’s no way to ‘guess’
at what the other surrogate pair should be, I decided it would be
better to replace the offending bytes with the Unicode replacement
character, as this is what broswers should be sending us anyway.
This means that users can post their comment without errors, and
there’s a visual hint to them that something’s gone awry - they’ll
be able to edit their comment to remove the replacement character
if they want.
*ahem*
There are four controllers that handle comment text: one each for
creating replies, step comments, or pre course discussion
comments, and one for editing comments. This commit adds a
concern, `ScrubCommentText`, for use in all of them.
This concern uses a before filter on the `create` and `update`
actions, which checks for the presence of comment text params, and
replaces any invalid bytes with the Unicode Replacement Character
� [6].
Want to know more about how we use Git? Read our post on how we tell stories in our commits.