How can you mine continuous data streams?
This week we look at mining streams of data in real time.
Imagine taking input from a sensor, or from a webcam, or from a Twitter feed. It keeps coming in, continually, inexorably – forever! And a data mining system must keep up with this, creating and maintaining a model that you can inspect at any time – but which changes continuously as new data comes in. There’s no limit. This is bigger than big data: it’s potentially infinite.
How can you deal with this? First, let’s make clear what we will not be tackling. We won’t be talking about how to connect sensors to a data mining program, nor how to deploy models that affect the world on a continuous basis. And we’ll introduce algorithms that are real-time in principle, ones that update their models in a time that is independent of the volume of data already seen – but we won’t concern ourselves with implementing them to work fast enough to keep up with the data produced by any particular sensor.
That still leaves plenty of open questions. For example, can you make a stream-oriented decision tree classifier? You could force J48 to work incrementally by re-running it from scratch as each new data sample arrives. But that couldn’t possibly work in real time, because no matter how quickly it operates there will come a point where the volume of data overwhelms its ability to produce a new model before the next data sample arrives. And anyway, data streams tend to evolve as time goes on, so it’s probably inappropriate to build a model of the entire stream right back to the beginning of time. And what about evaluation? Our trusty method of cross-validation won’t work any more, because you will never have the full dataset. What can take its place?
Our end-of-week example is an application of data mining to bioinformatics. We focus on signal peptide prediction. What’s the problem? And how can appropriate features be created from raw DNA?
At the end of this week you will be able to explain, at a high level, how decision trees can be modified incrementally, and compare the performance of incremental and non-incremental decision tree algorithms. You’ll be able to use Weka’s MOA (Massive Online Analysis) package. And you’ll be able to use the MOA system itself, which not only contains stream-oriented implementation of many packages, but allows different evaluation techniques designed for incremental operation. You’ll know the difference between holdout and prequential evaluation. You’ll know about adaptive windowing and how to use it for change detection. You’ll also have some experience of sentiment analysis using Twitter data.