Matthew Sutton

Matthew Sutton

I am a PhD student with research interests in genomics, operations research, big data and machine learning.

Location Brisbane, QLD, Australia

Activity

  • Hi I'm Matt, a mentor for this course. I'm in the second year of my PhD working on Big Data methods for biomedical research. I'm interested in how Big Data can be used to help design vaccinations, or identify brain regions that are affected in certain diseases. For these applications, the size of the data and complexity of the model makes data visualization a...

  • Great questions! In response to the first question; streaming, sketching and sequential updating are more advanced techniques so we did not describe them in detail here. Broadly these methods offer different ways to implement algorithms so that they scale with data. An article here...

  • Hi Caroline, I haven't tried to run the software on a tablet before. However, I can still try and help with this. First how much memory does your tablet have? Also could you see my comment below to check if your tablet can run the 64 bit software. If it has enough memory and can run 64 bit software i think it should be possible to run the vm. Alternately, the...

  • Hello Erick and Patrick,
    I think that you might be able to run the 64-bit software running if you have a 32-bit operating system but 64-bit compatible hardware. I have attempted to answer the question in more detail in the comments above:
    https://www.futurelearn.com/courses/big-data-decisions/3/steps/173560/comments#fl-comments
    In particular, take a look...

  • Hello learners,
    I've noticed a few comments on the compatibility of computers with the Cloudera virtual machine technology, in particular, whether a 32-bit machine can run the software. To help aid with this I would refer you to the following article:
    http://www.memuplay.com/blog/index.php/2016/01/27/enable-hardware-virtualization/
    The article mentions a...

  • Great description Santiago. SQL is a great tool for analysing data that is structured into tables. These tables are structured by relationships for example; one table in the database may contain CUSTOMER data, while another may contain PRODUCT data. SQL allows you to select and manipulate data from multiple tables by exploiting relationships common to the two...

  • Great answer! From the perspective of someone researching Big Data for biological datasets, I believe the effect of Big Data on the health industry will continue to grow.

    If applications of Big Data in biological research are of interest to you, I'd recommend checking out the Big Data to Knowledge (BD2K) initiative which has been delivering weekly virtual...

  • Depends on the situation, if the size of the data is truly unruly and your team doesn't have much experience, then getting someone who knows the big data platforms would help. If you have the time to train the team then that might be a better situation for the long run. Either way it is important to keep good communication between the people implementing the...

  • Domain expertise is vital for any data analysis, and developing the communication skills to achieve this can be very challenging. I'd like to mention that data management skills (e.g. ensuring the data is reliable) are often overlooked but are very important for an analysis to be effective.

  • No worries, and good luck with the MOOCs. Tomasz Bednarz (lead educator) here's a link: https://twitter.com/tomaszbednarz

  • Visualisation for Big Data is very important (see Mooc 4 in the series). The recent explosion in WebVR and WebAR has been very promising for applications with Big Data. If you're in Brisbane you might enjoy the Brisbane VR hackathon held at the powerhouse in June http://vrhackathon.web3d.org/brisbane-2/. I'd also recommend following Tomasz on twitter - he...

  • Thanks for the insight! We welcome the philosophical issues and the technical stuff :)
    Lots of the existing methods such as neural networks did not have the computing power or vast quantity of data to be effective 20-30years ago. The amount of data now available has changed this drastically. Techniques we thought were not effective are now state-of-the-art....

  • Unfortunately, the terminology can be tricky but don't be too put off. Even Big Data experts can find it confusing.

    Give the game at http://pixelastic.github.io/pokemonorbigdata/ a go. The aim is to try and tell which nonsense word is a Pokemon character and which is a Big Data technology

  • I think a lot of us are thinking about the effect of our Big Data research. As you said, it is a tool that can be used for good or bad. There was a good article on the effect of Big Data on politics at the conversation https://goo.gl/mTI02u

  • Matthew Sutton made a comment

    Hi I'm Matt, a mentor for this course. I'm in the second year of my PhD working on Big Data methods for biomedical research.
    I'm interested in how Big Data can be used to help design vaccinations, or identify brain regions that are affected in certain diseases. For these applications, the size of the data often leads to a lot of statistical...

  • Thanks for pointing this out! Setting alpha = 1 corresponds to the lasso method (L1 penalisation) whereas alpha = 2 corresponds to the ridge (L2 penalisation). We've flagged this and will remove the error as soon as possible.

  • Good suggestion Allan, we'll pass this on and try to fix this up for future runs of the course

  • Hi Viki, you could re-run the final instructions (pg 7) from the previous notes http://bit.ly/2aT9Z8E make sure you type cd ~/Desktop and then the git clone command. After that the command cd ~/Desktop/FLbigdataD2D/data should work.

  • Hi Dylan, sorry to hear about the struggles with cloudera. I don't think we've had the issue of the drop down box before, was it a browser thing? We are trying to keep a course help page http://bit.ly/2blHArj for common issues so let us know if you think its worth including. As for the software, the VM is probably the simplest way to get Hadoop running...

  • Sorry to hear that. It might be worth saving the downloads on a USB so that if you get your hands on another computer you can come back to the course and run the install files without waiting for the downloads.

  • Good to hear you solved the issue. Have fun with hadoop!

  • Hi Scott, have you checked that the virtual machine settings are correct for your windows machine? From the course help https://youtu.be/-Wa7TGjmn5M . If this isn't the issue you could try following the suggestions for common cloudera issues at:...

  • Unfortunately virtual machines do take up a fair bit of space, hope you enjoy the rest of the course

  • Hi Anjani, what operating system are you using? It might be a problem with the virtualisation settings on your computer. If you have a version of windows this link guide you through changing these settings: https://youtu.be/-Wa7TGjmn5M

  • The exercise is really a survey for people to find out what other learners know about big data going into the course. You can check the results of what others wrote at https://futurelearn.typeform.com/report/o7rRRD/uXDP .

  • Our production team wasn't able to reproduce the error either. From the team: " Bad dns could be any computer in the chain from where you are to the server… so it could be a whole country that’s temporarily out… or similar". So the best advice we can give is to try again later. Not the best solution but we will continue to look into this. Let us know if this...

  • Hmmm, I've taken a quick look around and I'm not sure why this error is occurring. The page http://bit.ly/2b9Vhd7 suggests that it might be an issue with the firewall or your internet connection. You could follow the instructions on the webpage or try in firefox. I've sent an email to see if the issue is on the Future Learn side and I'll let you know if...

  • Weird.. I've just checked on my computer and it seemed fine. What browser are you using? I'm using firefox and I've tried it on chrome too.

  • Searching for good data source and making sure data is clean is usually more time consuming than the analysis https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/. @NathanielBooth good list, people often forget that data collection is incredibly important.

  • Great answers here already. SQL is still used frequently in industry and can be an effective tool for the analysis of reasonably large structured data, so a base knowledge is useful. Introducing SQL also makes it easier to explain big data alternatives such as Hadoop.

  • Great to hear! We do cover a lot of material so it's up to you to decide how far you want to take it. Don't worry if you don't have the time to run through all the examples. The course introduces a large number of technologies and it's more important that you see what's available for big data analysis.

  • Great link, for practical data visualisation!

  • That should be a 5e4. When you type a for command the matlab will wait for the corresponding end before executing the commands.

  • Its due to precision errors with different operating systems in matlab check the FAQ: https://goo.gl/hIwH0N

  • Good observation that the 3rd and 4th columns of the matrix U do not match the provided answers due to the choice of null space. However, I've found that the singular values 12 and 6 match up with the square root of the eigenvalues of C'*C. what values are you getting?

  • Nice links! The challenge exercises from mit look interesting

  • After making the sample try running the command summary(market_data_sample). You should have columns with names: housing, job, day,..., emails.moth and y. If not then the data is not being read correctly so you could try re-downloading the data from the dropbox, restarting R and trying the commands again.

  • The book "Elements of statistical learning" has a good image to show partitions (see figure 9.2). The book is free from here http://statweb.stanford.edu/~tibs/ElemStatLearn/

  • Strange error... Are you running windows or Mac? This page might help https://support.rstudio.com/hc/en-us/community/posts/205207678-Blank-screen-when-opening-RStudio

  • 1) The suppressed variable importance values are all quite small you can see them with View(h2o.varimp(glm_model)).
    2) The results will differ slightly with a new training and testing split. However, they should be similar and tell the same story. For example my VI values are 0.97, 0.74, 0.55, 0.46 and 0.45 but the corresponding variables are in the same...

  • 1) The predictors have descriptions given at the github page http://bit.ly/1UWXhVT. This is a fictional dataset that contains variables that seem relevant for call data analysis.
    2) This was answered here http://bit.ly/1s33SSN and I'll add it to the FAQ now.
    3) Dealing with missing data is outside of the scope of the course right now - Briefly, there are...

  • What was the issue? If its something lots of people will have issues with we'll add it to the course living FAQ https://goo.gl/F8l06n

  • Unfortunately the exercises here require R specific packages so you won't be able to produce the graph in H2O using the UI. The error you're having might be an incorrect file path. Make sure you unzip the folder into the same directory as your documents and downloads folders. Then the commands in the instructions should work. Failing that, double check you...

  • Miles's answer is here https://www.futurelearn.com/courses/big-data-machine-learning/1/steps/81564/comments?page=1#comment_12485333 for why we remove the variable. The response variable is the 20th variable in the data i.e. market_data[,20]

  • split_data is a random split of the data - you can set a seed if you want to reuse a random split h2o.splitFrame(..., seed=1) or if you want to replicate the results.

  • The response variable for the glm model is actually the 20th variable ("y") and the predictor variables are the other 1:19 variables ("x=1:19" in the function). As Miles has commented (https://www.futurelearn.com/courses/big-data-machine-learning/1/steps/81564/comments?page=1#comment_12485333) the 11th variable is removed because of something called Leakage...

  • Yup, Tukey wrote a great book on the subject of Exploratory Analysis and this is usually where you start for a statistical analysis of a dataset. However, in high dimensional settings graphs and traditional approaches to visualisation can become infeasible - you don't want to do a histogram of every variable if you have several thousands of variables.

  • H2O offers a large number of machine learning algorithms that are written for parallel processing and can run with a large number of different platforms - HDFS, SQL and NoSQL. Just taking advantage of the parallel aspect can have a huge add in speed (https://infogr.am/h2o-scaling-the-limits-of-r). Moreover, H2O is able to handle data without running into...

  • sounds like the data is not being stored in a readable place. Make sure that when you download the dataset you unzip it into the same directory as the Documents and Downloads folder - then you should be able to read the data using the file path; filePath = "~/FLbigdataStats/bank_customer_data.csv".

  • Good question, to subset data you can use: "market_dataex1 <- market_data[,-c(1,2,4)]" to take everything except the 1st 2nd and 4th variables or you can use "market_dataex1 <- market_data[,c(1,2,4)]" to take only those variables. The results are dependant on cross validation which takes random splits of the data. The random splitting gives different (but...

  • H2O has a nice interface that you can use if you don't want to deal directly with R. There's a nice clip of the interface at http://blog.h2o.ai/2014/11/introducing-flow/ if you want to see it before you download. To install the file follow the instructions at: http://www.h2o.ai/download/h2o/desktop

  • Hi Marijcke, it sounds like you've downloaded the wrong dataset. Try going to the Github page https://github.com/QUT-BDA-MOOC/FLbigdataStats and clicking the download zip file. After unzipping the file to your top directory (the one with documents and downloads folders) try reading the data again.

  • What problems are you having with the software? R or H2O?

  • Good suggestion, trees are popular right now and the randomForest package in R is a good implementation. Also popular is generalize boosted machines; the GBM method is covered using H2O later in the course https://www.futurelearn.com/courses/big-data-machine-learning/1/steps/81565.

  • Great example of machine learning in practice!

  • Sorry Abdul, I've taken another look at the H2O website - http://www.h2o.ai/product/recommended-systems-for-h2o/ and it doesn't look like the H2O app is supported via phone. I've tried to get in contact with them to see if they can/will make it available on phone and I'll let you know if there's anything available! Sorry again for any inconvenience..

  • I've never tried to run H2O from a phone before, but you might be able to get it to work using the R app: https://play.google.com/store/apps/details?id=appinventor.ai_RInstructor.R2&hl=en And then installing the H2O package.

  • It looks like the tutorial might be out of date. The key argument has been replaced by destination_frame since the last stable version of H2O. See: http://stackoverflow.com/questions/31442820/unable-to-convert-data-frame-to-h2o-object

    For importing the Iris data try: iris.hex = h2o.importFile(path = irisPath, destination_frame = "iris")

  • Hi Akram Amari, here is a direct link http://www.cs.waikato.ac.nz/ml/weka/downloading.html

  • Hadoop Distributed File system. There is more information at step 1.21 and in the linked article :)

  • Glad to hear you're doing well now.

    Big data can help inform government policy and also guide city planners who are looking to build cancer care services. The disparities in cancer incidence and survival are evident across geographical areas are sometimes attributed to (but not restricted to): environmental factors, screening and diagnosis, migration of...

  • Machine Learning as a research discipline started with the study of Artificial Intelligence by computer scientists. Over time these methods have become intertwined with statistics and we see a huge overlap in the fields. Generally it refers to analytical methods that have been influenced by computer scientists and statisticians to help make predictions and...

  • Really any mathematical knowledge is an asset for big data since they develop logical thinking and problem solving skills. That said, the more practical Mathematics backgrounds to have are; applied mathematics, statistics and operations research. You will get a better idea of the level of mathematical knowledge needed in our next course:...

  • Nice article!

  • It might have something to do with the software vs hardware visualisation. A similar problem was answered here at VirtualBox https://www.virtualbox.org/ticket/3125. You could try page 10 of the VirtualBox manual here http://download.virtualbox.org/virtualbox/2.1.2/UserManual.pdf. If you still have problems you could try post a ticket to the VirtualBox website

  • Maybe try a different browser?

    The script is copied below:
    Let’s take a look at sample pseudo code for Mapper and Reducers, of our example of counting words from a text file, for example one presented about the diagram. Mapper pseudo code goes as presented on the slide. The Mapper input is a line of text (string). For each word in the input line, the...

  • Sorry to hear that I wasn't working. The video is a good wlakhough of what you would've been doing anyway

  • I'm not sure I understand what you mean by countries having restricted data? Unfortunately if you don't have the data then you can't do the analysis. Operations research might be a better way to go in this case.

  • I'll try rattle off a couple of examples.

    In business: often you hear the term “data driven” what this means is that the business is trying to incorporate data in its decision making process rather than relying solely on instinct or making business decisions based on a “we always do it like this” attitude. For example in Online shopping the time users spend...

  • SQL has a sharp learning cure. After getting the basics of joins, selects, deletes and creating tables though you should find that you know how to do most of the things you want to do.

  • Thanks Chaithu for the example. Big data is something that in my opinion will have an increasing presence in finance. In particular, Algorithmic trading often deals with large amounts of financial information: http://bit.ly/1atP3jC

  • Those system specs sound good enough. What was going wrong?

  • Basically we want to head from the massive dataset that we have (from possibly Peta or Exabytes) model and analyse it and integrate the results of the analysis into a final summary of the data (in the range of Kiliobytes). Hopefully this helped, but if you have any more questions let us know

  • Hi Helen, "hashing" in computer science refers to indexing items so that they can be retrieved or sorted faster. In Hadoop, when a record has been mapped to a key value pair it will be sent to one of the reducers. The record's key is hashed to determine which reducer the key belongs in. This process can be come a bit more advanced and is related to...

  • Another interesting question would be what is not big data? This article describes some of the misconceptions on what BigData actually is http://bit.ly/1Sh7BZ0 .

  • Nice find! were you able to fix up the code?

  • Love the graphics on this site. I saw a similar article about toxicity vs supportiveness for Reddit at http://bit.ly/19hRiGA

  • Great article, it really highlights the fact that correlation doesn’t imply causation and how hard it can be to continually calibrate a model. As mentioned in the article, Big data shows a lot of promise in this area; in fact a team at Harvard recently put up a revised version of google's flu tracker. It'll be interesting to see if the model continues to...

  • Nice link, I liked the example (on page 1) about using Hadoop for analyzing shopping patterns. Hadoop is particularly good with unstructured data like clickstream data from online shopping websites

  • You're right to be wary of data visualizations. An old Mark Twain quote says "there are three types of lies - lies, damn lies and statistics". This article talks about how different choices for creating data visualizations can tell completely different stories...

  • A great source of free American healthcare data is http://www.healthdata.gov. Also for those curious PII stands for personally identifiable information.

  • Genomic sequencing has become one of the most exciting areas for big data analysis. Some of the tools we introduce here (Hadoop and spark in particular) are being used to tackle these data related tasks. Check this paper out if you're interested https://www.mapr.com/blog/hadoop-and-genome-sequencing-perfect-match

  • Glad you're finding it interesting! If you can't download Cloudera then watching the video should provide some sense of working with HDFS and how large data can be managed. The course showcases a number of technical methods that you could return to when you encounter some truly large data. If you don't want to do the practical parts, then just reading the...

  • @MichaelRenner Thanks for the feedback. The technical challenges with downloading and installing cloudera can be frustrating. However, we included this software in the lectures since it is both free and widely used in the big data community. And there's no easy way around the download time unfortunately.
    We have tried to make the process a bit less daunting...

  • Hi Abraham, could you let us know what sections were hard or confusing in the course? This is the first time we've run a mooc so your input is appreciated

  • Hi Roberto, a database schema provides a template for the data. In MySQL a schema is defined and the data is then stored to fit this template. Hadoop uses "schema-on-read" process instead. So for Hadoop we start with our data and add a schema to fit your needs. See this link for more information: http://blog.cask.co/2015/03/schema-on-read-in-action/

  • @James Heron

    This was taken from the course help https://goo.gl/27KBl2

    The VM does not boot or boots with error relating to AMD-V or Intel vt-x

    Your computer may have hardware virtualisation disabled. This site may help you diagnose and remedy this problem:
    http://www.sysprobs.com/disable-enable-virtualization-technology-bios

  • Great link, I really like the conclusions and future perspectives section.

  • There should be a link under the video for the transcript

  • The amount of data involved in image processing can be incredible. Facebook processes around 300 Million images and handles upwards of 500Tb daily http://www.cnet.com/news/facebook-processes-more-than-500-tb-of-data-daily/

  • Try test your big data terminology and pokemon knowledge at https://pixelastic.github.io/pokemonorbigdata/

  • To get a really good introduction to Big Data I'd recommend enrolling in all four of the Mini-moocs.

    BIG DATA: FROM DATA TO DECISIONS
    BIG DATA: STATISTICAL INFERENCE AND MACHINE LEARNING
    BIG DATA: MATHEMATICAL MODELLING
    BIG DATA: DATA VISUALISATION

    Our teaching team all find R to be an excellent platform for statistical analysis and...

  • The three V's are a good way to understand what big data is. Though even data with a small volume can be very complex, and can push the boundaries of current computational methods.