Skip to 0 minutes and 12 secondsBefore we start, I thought I’d show you where I live. I told you before that I moved to New Zealand many years ago. I live in a place called Hamilton. Let me just zoom in and see if we can find Hamilton in the North Island of New Zealand, around the center of the North Island.

Skip to 0 minutes and 30 secondsThis is where the University of Waikato is.

Skip to 0 minutes and 37 secondsHere is the university; this is where I live.

Skip to 0 minutes and 39 secondsThis is my journey to work: I cycle every morning through the countryside. As you can see, it’s really nice. I live out here in the country. I’m a sheep farmer! I’ve got four sheep, three in the paddock and one in the freezer. I cycle in – it takes about half an hour – and I get to the university. I have the distinction of being able to go from one week to the next without ever seeing a traffic light, because I live out on the same edge of town as the university. When I get to the campus of the University of Waikato, it’s a very beautiful campus. We’ve got three lakes.

Skip to 1 minute and 14 secondsThere are two of the lakes, and another lake down here. It’s a really nice place to work! So I’m very happy here. Let’s move on to talk about data mining and ethics. In Europe, they have a lot of pretty stringent laws about information privacy. For example, if you’re going to collect any personal information about anyone, a purpose must be stated. The information should not be disclosed to others without consent. Records kept on individuals must be accurate and up to date. People should be able to review data about themselves. Data should be deleted when it’s no longer needed. Personal information must not be transmitted to other locations. Some data is too sensitive to be collected, except in extreme circumstances.

Skip to 2 minutes and 4 secondsThis is true in some countries in Europe, particularly Scandinavia. It’s not true, of course, in the United States. Data mining is about collecting and utilizing recorded information, and it’s good to be aware of some of these ethical issues. People often try to anonymize data so that it’s safe to distribute for other people to work on, but anonymization is much harder than you think. Here’s a little story for you. When Massachusetts released medical records summarizing every state employee’s hospital record in the mid-1990’s, the Governor gave a public assurance that it had been anonymized by removing all identifying information – name, address, and social security number.

Skip to 2 minutes and 45 secondsHe was surprised to receive his own health records (which included a lot of private information) in the mail shortly afterwards! People could be re-identified from the information that was left there. There’s been quite a bit of research done on re-identification techniques. For example, using publicly available records on the internet, 50% of Americans can be identified from their city, birth date, and sex. 85% can be identified if you include their zip code as well. There was some interesting work done on a movie database. Netflix released a database of 100 million records of movie ratings. They got individuals to rate movies [on the scale] 1-5, and they had a whole bunch of people doing this – a total of 100 million records.

Skip to 3 minutes and 39 secondsIt turned out that you could identify 99% of people in the database if you knew their ratings for 6 movies and approximately when they saw them. Even if you only know their ratings for 2 movies, you can identify 70% of people. This means you can use the database to find out the other movies that these people watched. They might not want you to know that. Re-identification is remarkably powerful, and it is incredibly hard to anonymize data effectively in a way that doesn’t destroy the value of the entire dataset for data mining purposes.

Skip to 4 minutes and 13 secondsOf course, the purpose of data mining is to discriminate: that’s what we’re trying to do! We’re trying to learn rules that discriminate one class from another in the data – who gets the loan? – who gets a special offer? But, of course, certain kinds of discrimination are unethical, not to mention illegal. For example, racial, sexual, and religious discrimination is certainly unethical, and in most places illegal. But it depends on the context. Sexual discrimination is usually illegal … except for doctors. Doctors are expected to take gender into account when they make their make their diagnoses. They don’t want to tell a man that he is pregnant, for example. Also, information that appears innocuous may not be.

Skip to 4 minutes and 58 secondsFor example, area codes – zip codes in the US – correlate strongly with race; membership of certain organizations correlates with gender. So although you might have removed the explicit racial and gender information from your database, it still might be able to be inferred from other information that’s there.

Skip to 5 minutes and 16 secondsIt’s very hard to deal with data: it has a way of revealing secrets about itself in unintended ways. Another ethical issue concerning data mining is that correlation does not imply causation.

Skip to 5 minutes and 31 secondsHere’s a classic example: as ice cream sales increase, so does the rate of drownings. Therefore, ice cream consumption causes drowning? Probably not. They’re probably both caused by warmer temperatures – people going to beaches. What data mining reveals is simply correlations, not causation. Really, we want causation. We want to be able to predict the effects of our actions, but all we can look at using data mining techniques is correlation. To understand about causation, you need a deeper model of what’s going on. I just wanted to alert you to some of the issues, some of the ethical issues, in data

Skip to 6 minutes and 12 secondsmining, before you go away and use what you’ve learned in this course on your own datasets: issues about the privacy of personal information; the fact that anonymization is harder than you think; re-identification of individuals from supposedly anonymized data is easier than you think; data mining and discrimination – it is, after all, about discrimination; and the fact that correlation does not imply causation.

Data mining and ethics

Data mining is a powerful technology, and I urge you to be ethical in its use. Data is sensitive stuff and should be treated with care. Personal data is particularly sensitive, and surprisingly difficult to anonymize: individuals can often be “re-identified” in apparently anonymized data. Note that the very purpose of data mining is discrimination, which frequently raises legal and ethical issues. Furthermore, data mining reveals correlation, which should not be confused with causation.

Share this video:

This video is from the free online course:

Data Mining with Weka

The University of Waikato