Mal Pinder, one of our Developers, shares a commit message from the code repository for our main Ruby on Rails application, which was made a few weeks ago. As well as being an interesting story about debugging a strange problem with user input, it’s also an example of our policy of writing good commit messages.
The diff of the commit itself was a very small change – only a few lines of code – but we think writing in-depth commit messages help us understand our code better. Commit messages that detail why code was introduced, how it works, and what alternatives were considered, form a very useful reference for future developers (including our future selves) who want to make changes to it.
It’s been edited for formatting, and some links to our internal sites and Git history have been expanded in place so that you can understand the reference, but otherwise it’s unchanged.
Scrub invalid UTF-8 bytes from comment text params TL;DR Add a concern to all comment controllers that scrubs the comment text parameter, replacing any invalid bytes with a replacement character. Some time ago we had a couple of errors reported in Honeybadger with the message “ArgumentError: invalid byte sequence in UTF-8” when saving a comment . UTF-8 represents characters using between 1 and 4 bytes, but not all combinations of bytes map onto valid characters.  These are called ‘invalid byte sequences’. In our stack, user input is passed through to Rack, which turns them into a params hash, then the Rails controller uses those params to create the comment. As Rack doesn’t care about the data passed to it, the invalid string makes it all the way down to the model layer. The first comment validation is checking that the text attribute is not blank; at this point Ruby checks if the string is valid UTF-8, and then raises the error. Because Honeybadger can’t display the invalid bytes, we didn’t have enough information to solve the error right then, so we added some extra logging and waited for it to happen again.  There were no errors for a while, but then a very popular course opened, and we suddenly got quite a few more, up to a grand total of 25. From looking at the additional logs we’d added, we could get some more information about all these errors: - Most of them had user agents indicating browsers on older versions of Android - The invalid bytes always appeared after words or phrases, usually at the end of the comment, and the text read fine without them - All but one had a flash alert reading "Your comment could not be created: Text must not contain unsupported characters" - The invalid bytes were always “%ED%A0%BD” The Flash Alert All text in our database are stored in UTF-8, but for various reasons , we can’t save four-byte UTF-8 characters, such as emoji. Therefore we have a `UnicodeCharactersValidator` class, which makes sure any text entered into our database doesn’t contain them. This validator gets run before a comment is saved, but *after* the strings are coerced and checked as valid UTF-8. It puts that error message on the model, which the controller then turns into a flash alert, which is shown back to the user (along with their comment). The presence of this flash message means that the users must have first submitted a comment with valid UTF-8, but at least one four- byte character. It’s the on the re-submit of the comment that this error is raised. Those Invalid Bytes “%ED%A0%BD” is the url-safe hex encoding of the UTF-16 high surrogate “D83D”.
>> ((0xED & 0xF) << 12) + ((0xA0 & 0x3F) << 6) + (0xBD & 0x3F) => 55357 >> _.to_s(16) => "d83d"
In UTF-16, values in the range [D800, DBFF] are ‘high surrogates’ which need to be followed by a low surrogate in the range [DC00, DFFFF] to represent a character. The first code point that would begin D83D is thus encoded as D83D DC00, giving:
>> 0x10000 + ((0xD83D - 0xD800) << 10) + ((0xDC00 - 0xDC00) & 0x3FF) => 128000 >> _.to_s 16 => "1f400"
> s = String.fromCharCode(0xD83D) '�' > s.length 1 > s = String.fromCharCode(0xD83D, 0xDE02) '?' > s.length 2
Want to know more about how we use Git? Read our post on how we tell stories in our commits.