1.16

## The University of Waikato

Skip to 0 minutes and 11 secondsHi! My name is Geoff Holmes, and today’s lesson, is infrared data from soil samples. Before starting to talk about the actual application we’ll develop, I thought I’d just mention something about application development in general. The top academic conference in machine learning is called ICML, International Conference on Machine Learning. This is where all the top people in the field present their work. In 2012, a paper was published at this conference which was something of a wake-up call to the machine learning community. The author was Kiri Wagstaff from the Jet Propulsion Lab in Pasadena, CA, and the paper, which is accessible to anyone with an interest in machine learning, is called Machine Learning that Matters.

Skip to 0 minutes and 50 secondsThe URL there on the slide will enable you to download it and read it. What the paper does is it points out that the field is focusing too much on new methods and on the accuracy of those methods and less on the kind of application that will really make a difference. What Kiri did was to suggest six challenges for machine learning applications. I’m not going to go through all the six that are listed there on the slide. I just

Skip to 1 minute and 21 secondswant to talk about the highlighted one: $100M saved through improved decision making provided by an ML system. Now, believe it or not, you can develop an ML system using near-infrared data on soil samples that will be something that could save$100 million. This lesson is a starting point for such a system, but it is possible. Before we do that, let’s just take a moment to think about what machine learning requires in order for us to develop an application of any kind. Well, it needs input and output in its training phase.

Skip to 2 minutes and 1 secondIn our case, we need a set of samples – those are going to be soil samples in some form, and you’ll see that in a while – and an output target value. In our case, this is going to be a real valued number, and will represent a property of interest of the soil. That could be organic carbon, organic nitrogen, available nitrogen, potassium. Something that we’re interested in predicting from the input. Our problem, of course, is to learn a mapping that describes the relationship between the input and the output. We refer to this mapping as a “model”.

Skip to 2 minutes and 37 secondsWe build the model on our training data, and then we use that model on unseen observations – new soil, if you like – in order to apply the model to the new soil in order for it to predict the target soil property of that soil that we’re interested in, such as the organic carbon. Now we need to think about where we’re going to get X and Y from for this particular application. Traditionally, soil samples are processed using techniques called “wet chemistry” techniques, and what those wet chemistry techniques are trying to do is determine the properties of the soil, such as available nitrogen, organic carbon and so forth. They will result in the Y values that we’re interested in.

Skip to 3 minutes and 35 secondsWhat we need for this application is for a number of samples to have been processed using wet chemistry to determine these Y values for us. Let’s say we’re interested in available nitrogen. We need, let’s say 50 or 100 different soil samples to have been processed using wet chemistry to produce 50 to 100 Y values. We need to take a portion of each of those samples from, let’s say, a thing called a “soil bank”. Suppose we’ve got a soil bank. We divide our soil sample into half. We send half off to the wet chemistry lab to get the property determined, and with the other half, we put that through a near-infrared device. That will produce the X values for our input.

Skip to 4 minutes and 25 secondsNow, the near-infrared device produces a signature, if you like, for the soil sample. I’ve got an example of one there below on the slide. These values will form the input. In the sense of an ARFF file, they represent the values or reflectance values for a given wavelength band. You’ll see in the ARFF file produced for the [quiz] that that starts at around 350 nanometers – that’s the first attribute. The next one might be 370 nanometers, 390 – or 80, 90, 400, 410, and so on.

Skip to 5 minutes and 3 secondsThe number of attributes we have, as you’ll see in the example, is something like 200 for each of those spectral wavelength bands, and then the values are numeric values, which are the amplitudes, if you like, of the spectrum, just the reflectance values that you get from the device. So as I said, you need a few hundred samples, so it’s not cheap, because you’ve got to send off – whatever number of samples you’ve got, it’s very cheap to get the X, but it’s expensive to get the Y, because you’ve got to send those off for wet chemistry analysis. So to put together a decent training set is expensive.

Skip to 5 minutes and 42 secondsGiven that, why would you bother doing that for the soil in this particular application? Well, once you’ve, let’s say, got your 50–100 samples and you’ve built your model, and if a farmer comes in with a new soil sample and says “I want to know what the available nitrogen is”, we just get out our available nitrogen model that we built and we get the NIR spectra for that new sample – that represents new X, if you like – we run it through the model, and it will produce an estimate of Y for that soil signature. We’ll be able to tell the farmer “for your soil sample, the available nitrogen is 4.3” (or whatever that estimated Y value is).

Skip to 6 minutes and 27 secondsInstead of days for the wet chemistry to take place, we’re talking about milliseconds for the NIR device to produce the signature for us to run through the model and get the estimate of Y. That’s the first thing that makes it useful. It’s very fast. Second thing that makes it useful is that we can produce, for the same input, if we’ve got enough models, an estimate for a number of soil properties, not just one. If we’ve got, for example, wet chemistry which has determined the potassium, available nitrogen, the organic carbon, the organic nitrogen, and so on, then we can build models for each of those and for the same X value, we can produce predictions for each of those soil properties.

Skip to 7 minutes and 15 secondsSo we can tell the farmer with the soil sample in very short order – of the order of milliseconds – what the values are for each of those soil properties. All right, so that’s the value of it. How do we actually go about doing the modeling? Well, the training set, remember, let’s imagine it’s an ARFF file. The right-most column, or the class column, would be a set of numeric values, so we’re talking about a regression problem. Then the attributes are all these reflectance values at various wavelengths. They’re all numeric values, as well. We’ve got X numeric values, and so is Y.

Skip to 7 minutes and 56 secondsThe classifiers of interest are things such as LinearRegression, RepTree, model tree M5 prime, RandomForest, support vector machine regression, GaussianProcesses, and so on. What I’ve done there is lined up the algorithms in terms of their processing speed. What you’ll do in the [Quiz] activity is you’ll process the data using the first four, because you’ll see that it’s quite a large dataset, and the other two take too long really to be useful. We’ll be saying more about that later. The big thing – message – though is that pre-processing can make a big difference to a classifier’s performance. What you’ll do is process the data raw, and then you’ll see what happens to the results when you start applying the pre-processing techniques.

Skip to 8 minutes and 39 secondsThe classifiers respond in different ways to the different pre-processing techniques. Some get better, some get worse, some stay the same. One thing that’s worth bearing in mind is that you’re about to enter experimental machine learning, where you’re going to have lots of results, because the [Quiz] activity takes you through the first four classifiers on the previous slide, but all in default mode. Now, each of them has parameters that can be tweaked, and so can each form the basis for a separate experiment. You’ll be using four pre-processing methods, one of which is to do nothing, just use the raw spectrum. Now, some of those methods themselves have parameters, as well. Of course, you can combine the pre-processing methods, as well.

Skip to 9 minutes and 27 secondsSo the space of experiments is extremely large. From all of that, you’ll be able to produce some pretty good results. Now, what you’ll be looking at is particularly the correlation coefficient. So how well does the predicted value match the known value from the training data using cross-validation? That will give you some idea of how close you are, and what want, of course, is to produce models that get you close to 1.0, the perfect correlation with what you’ve seen in training data previously. Now, you’ll see that that’s not possible, because there’s too much error in the data typically. But it will be a starting point.

Skip to 10 minutes and 15 secondsYou’ll mainly see the improvement you can get from that baseline or benchmarking that you do with the raw data to what happens when you apply various pre-processing techniques. I hope you enjoy that. I hope it wets your appetite for machine learning application development.

# Analyzing infrared data from soil samples

Some feel that data miners focus too much on new methods and tiny improvements in accuracy, instead of on applications that will make a real difference in practice. Geoff Holmes discusses six major challenges that have been proposed for data mining applications: one is to save \$100M through improved decision making. Inferring properties of soil samples from infrared data can save significant sums of money, because it can enable expensive wet chemistry to be replaced by an automatic process. However, achieving sufficient accuracy is a real challenge.