Skip main navigation

How to clean data using approximate string matching

Chris Lowis, Data Lead at FutureLearn, shares a quick tip for tidying up data using approximate or fuzzy string matching.

At FutureLearn we’ve taken a very agile approach to development. We try to build the simplest feature we think will work, get it in front of our learners as quickly as possible and then change it based on the feedback we get.

Our approaches to surveys are a good example of this. At first, we wanted to know a little bit about our learners so that we could focus our efforts on providing a great first experience. We used an off-the-shelf software solution, SurveyMonkey. It allowed us to quickly build some surveys so that we could concentrate on our product.

We’ve now run close to 100 courses and we now want to use the surveys to look at what motivates all of our learners in order to help develop the best choice of courses we can in the future.

Unfortunately the names we gave the surveys in the SurveyMonkey software didn’t exactly match the names we used for courses in the rest of the FutureLearn platform. This made matching the two together a bit tricky. In FutureLearn two runs of the same course can have the same name, for example The Secret Power of Brands and The Secret Power of Brands. To cope with this we give each course a unique identifier, which we call a slug.

This table shows how the name in SurveyMonkey maps to a particular course slug:

SurveyMonkey name Course title Course slug
GBBB Parkinsons pre-course survey Good brain, bad brain: Parkinson’s disease good-brain-bad-brain-parks
Southampton Portus pre-course Archaeology of Portus: exploring the lost harbour of ancient Rome portus
The secret power of brands: pre-course survey The secret power of brands secret-power-of-brands
Secret power of brands 2014 pre-course The secret power of brands secret-power-of-brands-2014

The first thing to do was to nip the problem in the bud before it got any worse by allowing our team to link a particular survey (referred to by its URL in SurveyMonkey) to a particular course using our internal content management system. All that remained was to map the 100-odd existing surveys to their corresponding courses so that we could start to analyse them further.

Fetching a list of survey names was relatively straightforward using the SurveyMonkey API and the RestClient gem.

The response contains the survey IDs (which we can use to fetch more information about the surveys from the API), the date the survey was created and its title. At this stage we could print out a list of the survey titles and a list of our course titles and perform the matching by hand. But to make this process a bit quicker we can filter the list of potential courses by comparing the survey title and the course title.

The FuzzyMatch gem finds the closest matching string to your search term (the ‘needle’, or in our case the survey title) from a large list of possible strings (the ‘haystack’, or here a list of possible course titles). It does this by using a combination of bigram comparison and Levenshtein distance. The great thing about the gem is that the defaults are very sensible and easy to use out-of-the-box, but it can be customised in a very flexible way to suit your application.

Here’s how the gem works using the examples from the table above:

The only problem is with the final example where the needle matches two courses in the haystack. In this particular case we could use a very simple heuristic to reduce the size of the haystack. We know that the survey couldn’t have been written after the course had started, so a simple filter allows us to reject impossible results:

In this example we create a slightly more complicated collection of objects for our list of courses, and tell `FuzzyMatch` to use the `title` attribute to search against. By filtering out the course list to only include realistic results our matcher is able to find the correct slug.

All that was left was to quickly go through the results to make sure they made sense, and then generate a data migration to back-fill this data into our new database schema.

I find this approach to be useful in a lot of situations where I have to tidy up data. It’s not a Ruby specific trick either, your language of choice probably has a similar string comparison library for example stringdist in R, or difflib in Python.

FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education

Related stories on FutureLearn