# A commit message from our repo

Mal Pinder, one of our Developers, shares a commit message from the code repository for our main Ruby on Rails application, which was made a few weeks ago. As well as being an interesting story about debugging a strange problem with user input, it’s also an example of our policy of writing good commit messages.

The diff of the commit itself was a very small change – only a few lines of code – but we think writing in-depth commit messages help us understand our code better. Commit messages that detail why code was introduced, how it works, and what alternatives were considered, form a very useful reference for future developers (including our future selves) who want to make changes to it.

It’s been edited for formatting, and some links to our internal sites and Git history have been expanded in place so that you can understand the reference, but otherwise it’s unchanged.

Scrub invalid UTF-8 bytes from comment text params

TL;DR Add a concern to all comment controllers that scrubs the
comment text parameter, replacing any invalid bytes with a
replacement character.

Some time ago we had a couple of errors reported in Honeybadger
with the message “ArgumentError: invalid byte sequence in UTF-8”
when saving a comment [1].

UTF-8 represents characters using between 1 and 4 bytes, but not
all combinations of bytes map onto valid characters. [2] These are
called ‘invalid byte sequences’.

In our stack, user input is passed through to Rack, which turns
them into a params hash, then the Rails controller uses those
params to create the comment. As Rack doesn’t care about the data
passed to it, the invalid string makes it all the way down to the
model layer. The first comment validation is checking that the
text attribute is not blank; at this point Ruby checks if the
string is valid UTF-8, and then raises the error.

Because Honeybadger can’t display the invalid bytes, we didn’t
have enough information to solve the error right then, so we added
some extra logging and waited for it to happen again. [3]

There were no errors for a while, but then a very popular course
opened, and we suddenly got quite a few more, up to a grand total
of 25. From looking at the additional logs we’d added, we could

- Most of them had user agents indicating browsers on older
versions of Android
- The invalid bytes always appeared after words or phrases,
usually at the end of the comment, and the text read fine
without them
created: Text must not contain unsupported characters"
- The invalid bytes were always “%ED%A0%BD”

All text in our database are stored in UTF-8, but for various
reasons [4], we can’t save four-byte UTF-8 characters, such as
emoji. Therefore we have a UnicodeCharactersValidator class,
which makes sure any text entered into our database doesn’t
contain them.

This validator gets run before a comment is saved, but *after* the
strings are coerced and checked as valid UTF-8. It puts that error
message on the model, which the controller then turns into a flash
alert, which is shown back to the user (along with their comment).

The presence of this flash message means that the users must have
first submitted a comment with valid UTF-8, but at least one four-
byte character. It’s the on the re-submit of the comment that this
error is raised.

Those Invalid Bytes

“%ED%A0%BD” is the url-safe hex encoding of the UTF-16 high
surrogate “D83D”.


>> ((0xED & 0xF) << 12) + ((0xA0 & 0x3F) << 6) + (0xBD & 0x3F)
=> 55357
>> _.to_s(16)
=> "d83d"

In UTF-16, values in the range [D800, DBFF] are ‘high surrogates’
which need to be followed by a low surrogate in the range [DC00,
DFFFF] to represent a character.

The first code point that would begin D83D is thus encoded as D83D
DC00, giving:


>> 0x10000 + ((0xD83D - 0xD800) << 10)
+ ((0xDC00 - 0xDC00) & 0x3FF)
=> 128000
>> _.to_s 16
=> "1f400"

or, Unicode Character 'RAT' (U+1F400)[5]. This is the beginning of
a long block of codepoints assigned to emojis, including most of
the common ones you can find on your phone keyboard. All the emoji
in this range begin D83D when encoded as surrogate pairs.

However, the request body containing %ED%A0%BD is a string encoded
as application/x-www-form-urlencoded, which works by encoding a
string as UTF-8 and turning any multi-byte sequences (and some
characters that need escaping like & and =) into %XX strings. If
you put multi-byte characters into a form and post it, you will
see a UTF-8 encoding of those characters arrive at the server -
for instance, U+1F602 would properly be encoded as
%F0%9F%98%82, not as %ED%A0%BD%ED%B8%82.

It is not valid to send values corresponding to surrogate pairs as
UTF-8, because they are not necessary in UTF-8. Pairs are needed
in UTF-16, for codepoints that don't fit in 2 bytes. UTF-8, being
a variable-length encoding, doesn't need them. So ED A0 BD is an
invalid byte sequence in UTF-8.

Most browsers won't allow unpaired surrogates to be entered, but
you can generate them using Javascript, which represents all
astral symbols (4-byte characters) as two characters:


> s = String.fromCharCode(0xD83D)
'�'
> s.length
1

> s = String.fromCharCode(0xD83D, 0xDE02)
'?'
> s.length
2

In Chrome, inserting String.fromCharCode(0xD83D) into a form and
submitting it sends Unicode Character 'REPLACEMENT CHARACTER'
(U+FFFD)[6] in its place. But some user agents might retain the
'character' and send it as application/x-www-form-urlencoded.

What we think is going on

We know users try to enter emoji in comments (we get occasional
UserVoice messages about it), but most see the warning, delete the
emoji, and post their comment.

Given the positioning of the invalid bytes (between phrases, or at
the end) and the flash message, we think this problem is also due
to people trying to use emoji.

However, something about these user agents is breaking the emoji
into two 'characters' and only sending us the first part, rather
than either deleting both parts, or replacing the remaining
invalid bytes with the replacement symbol. This could be due to
any number of reasons; perhaps they have a misbehaving browser,
javascript plugin, or keyboard app, or their keyboard doesn’t
support 4-byte characters but they work around it by copy-pasting
emoji in. (It could also have been truncated due to a proxy or
network error, but since the content-length header is correct,
this is unlikely).

Because this is basically a problem caused by a misbehaving client
sending us an invalid request, I found it impossible to replicate
using an actual device. However it is possible to write controller
tests that approximate the situation, since it’s only our model
layer that checks the string content.

How to fix this

Now we knew what the actual problem is, I had several options on
how to proceed.

Semantically, since it’s invalid data that’s being sent, the page
could respond with an HTTP 400 or 406. However this would show
an error page to the user, and discard the comment they’d tried to
make.

In general it’s best to persist user input wherever possible;
comments can be very long and/or represent a lot of effort on the
user’s part, and it’s a terrible and frustrating experience to
press submit and watch everything you wrote vanish.

Given that the best-guess for what’s happening is that users are
trying to delete these emoji anyway, and there’s no way to ‘guess’
at what the other surrogate pair should be, I decided it would be
better to replace the offending bytes with the Unicode replacement
character, as this is what broswers should be sending us anyway.

This means that users can post their comment without errors, and
there’s a visual hint to them that something’s gone awry - they’ll
be able to edit their comment to remove the replacement character
if they want.

*ahem*

There are four controllers that handle comment text: one each for
creating replies, step comments, or pre course discussion
concern, ScrubCommentText, for use in all of them.

This concern uses a before filter on the create and update
actions, which checks for the presence of comment text params, and
replaces any invalid bytes with the Unicode Replacement Character
� [6].
[1] This was originally a link to Honeybadger, which we use for tracking errors.
[3] This was originally a link to the commit that added the tracking. What we did was pass the content of request.env[‘rack.input’] to Honeybadger as additional ‘context’ when an ArgumentError of the right type is raised in any controller, and also a base64 encoded version, in case Honeybadger did any scrubbing of the bytes we sent it.
[4] This was originally a link to the commit that added the Validator, in which the reasoning is fully explained. In short, we still use MySQL’s utf8 character set, which doesn’t support 4-byte UTF-8 characters, and there’s also issues with display and moderation of emoji.

Want to know more about how we use Git? Read our post on how we tell stories in our commits.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

## An introduction to web analytics

We take a deep dive into the world of web analytics, and the importance of …

## A guide to animal welfare: How to protect animals

Animals are only now being recognised as sentient beings. Here, we discuss how animal welfare …

## How to train like an athlete – simple steps for exercising efficiently

We explore what makes a top athlete and how you can focus on training, nutrition, …