Matthew Sutton
I am a PhD student with research interests in genomics, operations research, big data and machine learning.
Location Brisbane, QLD, Australia
Activity
-
Matthew Sutton made a comment
Hi I'm Matt, a mentor for this course. I'm in the second year of my PhD working on Big Data methods for biomedical research. I'm interested in how Big Data can be used to help design vaccinations, or identify brain regions that are affected in certain diseases. For these applications, the size of the data and complexity of the model makes data visualization a...
-
Matthew Sutton replied to Michael Gillan
Great questions! In response to the first question; streaming, sketching and sequential updating are more advanced techniques so we did not describe them in detail here. Broadly these methods offer different ways to implement algorithms so that they scale with data. An article here...
-
Matthew Sutton replied to Matthew Sutton
Hi Caroline, I haven't tried to run the software on a tablet before. However, I can still try and help with this. First how much memory does your tablet have? Also could you see my comment below to check if your tablet can run the 64 bit software. If it has enough memory and can run 64 bit software i think it should be possible to run the vm. Alternately, the...
-
Matthew Sutton replied to Erick Keicher
Hello Erick and Patrick,
I think that you might be able to run the 64-bit software running if you have a 32-bit operating system but 64-bit compatible hardware. I have attempted to answer the question in more detail in the comments above:
https://www.futurelearn.com/courses/big-data-decisions/3/steps/173560/comments#fl-comments
In particular, take a look... -
Matthew Sutton made a comment
Hello learners,
I've noticed a few comments on the compatibility of computers with the Cloudera virtual machine technology, in particular, whether a 32-bit machine can run the software. To help aid with this I would refer you to the following article:
http://www.memuplay.com/blog/index.php/2016/01/27/enable-hardware-virtualization/
The article mentions a... -
Matthew Sutton replied to SIISU MOHAMMED
Great description Santiago. SQL is a great tool for analysing data that is structured into tables. These tables are structured by relationships for example; one table in the database may contain CUSTOMER data, while another may contain PRODUCT data. SQL allows you to select and manipulate data from multiple tables by exploiting relationships common to the two...
-
Matthew Sutton replied to Joseph Nkfusai
Great answer! From the perspective of someone researching Big Data for biological datasets, I believe the effect of Big Data on the health industry will continue to grow.
If applications of Big Data in biological research are of interest to you, I'd recommend checking out the Big Data to Knowledge (BD2K) initiative which has been delivering weekly virtual...
-
Matthew Sutton replied to Joe Corso
Depends on the situation, if the size of the data is truly unruly and your team doesn't have much experience, then getting someone who knows the big data platforms would help. If you have the time to train the team then that might be a better situation for the long run. Either way it is important to keep good communication between the people implementing the...
-
Matthew Sutton replied to Johan Boesveld
Domain expertise is vital for any data analysis, and developing the communication skills to achieve this can be very challenging. I'd like to mention that data management skills (e.g. ensuring the data is reliable) are often overlooked but are very important for an analysis to be effective.
-
Matthew Sutton replied to Nicolas Philippon
No worries, and good luck with the MOOCs. Tomasz Bednarz (lead educator) here's a link: https://twitter.com/tomaszbednarz
-
Matthew Sutton replied to Nicolas Philippon
Visualisation for Big Data is very important (see Mooc 4 in the series). The recent explosion in WebVR and WebAR has been very promising for applications with Big Data. If you're in Brisbane you might enjoy the Brisbane VR hackathon held at the powerhouse in June http://vrhackathon.web3d.org/brisbane-2/. I'd also recommend following Tomasz on twitter - he...
-
Matthew Sutton replied to Herbert Mehlhose
Thanks for the insight! We welcome the philosophical issues and the technical stuff :)
Lots of the existing methods such as neural networks did not have the computing power or vast quantity of data to be effective 20-30years ago. The amount of data now available has changed this drastically. Techniques we thought were not effective are now state-of-the-art.... -
Matthew Sutton replied to Nikola Cabanova
Unfortunately, the terminology can be tricky but don't be too put off. Even Big Data experts can find it confusing.
Give the game at http://pixelastic.github.io/pokemonorbigdata/ a go. The aim is to try and tell which nonsense word is a Pokemon character and which is a Big Data technology
-
Matthew Sutton replied to Robert Dyson
I think a lot of us are thinking about the effect of our Big Data research. As you said, it is a tool that can be used for good or bad. There was a good article on the effect of Big Data on politics at the conversation https://goo.gl/mTI02u
-
Matthew Sutton made a comment
Hi I'm Matt, a mentor for this course. I'm in the second year of my PhD working on Big Data methods for biomedical research.
I'm interested in how Big Data can be used to help design vaccinations, or identify brain regions that are affected in certain diseases. For these applications, the size of the data often leads to a lot of statistical... -
Thanks for pointing this out! Setting alpha = 1 corresponds to the lasso method (L1 penalisation) whereas alpha = 2 corresponds to the ridge (L2 penalisation). We've flagged this and will remove the error as soon as possible.
-
Good suggestion Allan, we'll pass this on and try to fix this up for future runs of the course
-
Matthew Sutton replied to Björn Redlund
Hi Viki, you could re-run the final instructions (pg 7) from the previous notes http://bit.ly/2aT9Z8E make sure you type cd ~/Desktop and then the git clone command. After that the command cd ~/Desktop/FLbigdataD2D/data should work.
-
Matthew Sutton replied to Dylan Kennedy
Hi Dylan, sorry to hear about the struggles with cloudera. I don't think we've had the issue of the drop down box before, was it a browser thing? We are trying to keep a course help page http://bit.ly/2blHArj for common issues so let us know if you think its worth including. As for the software, the VM is probably the simplest way to get Hadoop running...
-
Matthew Sutton replied to Oscar Lovell
Sorry to hear that. It might be worth saving the downloads on a USB so that if you get your hands on another computer you can come back to the course and run the install files without waiting for the downloads.
-
Matthew Sutton replied to Anjani Dhrangadhariya
Good to hear you solved the issue. Have fun with hadoop!
-
Matthew Sutton replied to Scott Miller
Hi Scott, have you checked that the virtual machine settings are correct for your windows machine? From the course help https://youtu.be/-Wa7TGjmn5M . If this isn't the issue you could try following the suggestions for common cloudera issues at:...
-
Matthew Sutton replied to Becky D
Unfortunately virtual machines do take up a fair bit of space, hope you enjoy the rest of the course
-
Matthew Sutton replied to Anjani Dhrangadhariya
Hi Anjani, what operating system are you using? It might be a problem with the virtualisation settings on your computer. If you have a version of windows this link guide you through changing these settings: https://youtu.be/-Wa7TGjmn5M
-
Matthew Sutton replied to Scott Miller
The exercise is really a survey for people to find out what other learners know about big data going into the course. You can check the results of what others wrote at https://futurelearn.typeform.com/report/o7rRRD/uXDP .
-
Matthew Sutton replied to Hasmin Sandoval
Our production team wasn't able to reproduce the error either. From the team: " Bad dns could be any computer in the chain from where you are to the server… so it could be a whole country that’s temporarily out… or similar". So the best advice we can give is to try again later. Not the best solution but we will continue to look into this. Let us know if this...
-
Matthew Sutton replied to Hasmin Sandoval
Hmmm, I've taken a quick look around and I'm not sure why this error is occurring. The page http://bit.ly/2b9Vhd7 suggests that it might be an issue with the firewall or your internet connection. You could follow the instructions on the webpage or try in firefox. I've sent an email to see if the issue is on the Future Learn side and I'll let you know if...
-
Matthew Sutton replied to Hasmin Sandoval
Weird.. I've just checked on my computer and it seemed fine. What browser are you using? I'm using firefox and I've tried it on chrome too.
-
Matthew Sutton replied to Paul Kennedy
Searching for good data source and making sure data is clean is usually more time consuming than the analysis https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/. @NathanielBooth good list, people often forget that data collection is incredibly important.
-
Great answers here already. SQL is still used frequently in industry and can be an effective tool for the analysis of reasonably large structured data, so a base knowledge is useful. Introducing SQL also makes it easier to explain big data alternatives such as Hadoop.
-
Great to hear! We do cover a lot of material so it's up to you to decide how far you want to take it. Don't worry if you don't have the time to run through all the examples. The course introduces a large number of technologies and it's more important that you see what's available for big data analysis.
-
Great link, for practical data visualisation!
-
Matthew Sutton replied to Paul Haines
That should be a 5e4. When you type a for command the matlab will wait for the corresponding end before executing the commands.
-
Its due to precision errors with different operating systems in matlab check the FAQ: https://goo.gl/hIwH0N
-
Matthew Sutton replied to Vadim Sovetkin
Good observation that the 3rd and 4th columns of the matrix U do not match the provided answers due to the choice of null space. However, I've found that the singular values 12 and 6 match up with the square root of the eigenvalues of C'*C. what values are you getting?
-
Nice links! The challenge exercises from mit look interesting
-
Have you tried the suggestions here https://www.futurelearn.com/courses/big-data-machine-learning/1/steps/81560/comments?page=3#comment_12500757 ?
-
After making the sample try running the command summary(market_data_sample). You should have columns with names: housing, job, day,..., emails.moth and y. If not then the data is not being read correctly so you could try re-downloading the data from the dropbox, restarting R and trying the commands again.
-
Matthew Sutton replied to Loice Atieno
The book "Elements of statistical learning" has a good image to show partitions (see figure 9.2). The book is free from here http://statweb.stanford.edu/~tibs/ElemStatLearn/
-
Strange error... Are you running windows or Mac? This page might help https://support.rstudio.com/hc/en-us/community/posts/205207678-Blank-screen-when-opening-RStudio
-
1) The suppressed variable importance values are all quite small you can see them with View(h2o.varimp(glm_model)).
2) The results will differ slightly with a new training and testing split. However, they should be similar and tell the same story. For example my VI values are 0.97, 0.74, 0.55, 0.46 and 0.45 but the corresponding variables are in the same... -
1) The predictors have descriptions given at the github page http://bit.ly/1UWXhVT. This is a fictional dataset that contains variables that seem relevant for call data analysis.
2) This was answered here http://bit.ly/1s33SSN and I'll add it to the FAQ now.
3) Dealing with missing data is outside of the scope of the course right now - Briefly, there are... -
Matthew Sutton replied to Mischa Peters
What was the issue? If its something lots of people will have issues with we'll add it to the course living FAQ https://goo.gl/F8l06n
-
Unfortunately the exercises here require R specific packages so you won't be able to produce the graph in H2O using the UI. The error you're having might be an incorrect file path. Make sure you unzip the folder into the same directory as your documents and downloads folders. Then the commands in the instructions should work. Failing that, double check you...
-
You need to install the packages --> install.packages("dplyr") etc. Details here https://ugc.futurelearn.com/uploads/files/cf/fb/cffb606a-5e41-47a8-978e-9ee76d843e59/Instructions_SettingUpCourseSoftware.pdf
-
Miles's answer is here https://www.futurelearn.com/courses/big-data-machine-learning/1/steps/81564/comments?page=1#comment_12485333 for why we remove the variable. The response variable is the 20th variable in the data i.e. market_data[,20]
-
split_data is a random split of the data - you can set a seed if you want to reuse a random split h2o.splitFrame(..., seed=1) or if you want to replicate the results.
-
The response variable for the glm model is actually the 20th variable ("y") and the predictor variables are the other 1:19 variables ("x=1:19" in the function). As Miles has commented (https://www.futurelearn.com/courses/big-data-machine-learning/1/steps/81564/comments?page=1#comment_12485333) the 11th variable is removed because of something called Leakage...
-
Yup, Tukey wrote a great book on the subject of Exploratory Analysis and this is usually where you start for a statistical analysis of a dataset. However, in high dimensional settings graphs and traditional approaches to visualisation can become infeasible - you don't want to do a histogram of every variable if you have several thousands of variables.
-
H2O offers a large number of machine learning algorithms that are written for parallel processing and can run with a large number of different platforms - HDFS, SQL and NoSQL. Just taking advantage of the parallel aspect can have a huge add in speed (https://infogr.am/h2o-scaling-the-limits-of-r). Moreover, H2O is able to handle data without running into...
-
make sure you have installed the relevant R packages (particularly dplyr) https://ugc.futurelearn.com/uploads/files/cf/fb/cffb606a-5e41-47a8-978e-9ee76d843e59/Instructions_SettingUpCourseSoftware.pdf
-
sounds like the data is not being stored in a readable place. Make sure that when you download the dataset you unzip it into the same directory as the Documents and Downloads folder - then you should be able to read the data using the file path; filePath = "~/FLbigdataStats/bank_customer_data.csv".
-
Good question, to subset data you can use: "market_dataex1 <- market_data[,-c(1,2,4)]" to take everything except the 1st 2nd and 4th variables or you can use "market_dataex1 <- market_data[,c(1,2,4)]" to take only those variables. The results are dependant on cross validation which takes random splits of the data. The random splitting gives different (but...
-
H2O has a nice interface that you can use if you don't want to deal directly with R. There's a nice clip of the interface at http://blog.h2o.ai/2014/11/introducing-flow/ if you want to see it before you download. To install the file follow the instructions at: http://www.h2o.ai/download/h2o/desktop
-
Hi Marijcke, it sounds like you've downloaded the wrong dataset. Try going to the Github page https://github.com/QUT-BDA-MOOC/FLbigdataStats and clicking the download zip file. After unzipping the file to your top directory (the one with documents and downloads folders) try reading the data again.
-
What problems are you having with the software? R or H2O?
-
Good suggestion, trees are popular right now and the randomForest package in R is a good implementation. Also popular is generalize boosted machines; the GBM method is covered using H2O later in the course https://www.futurelearn.com/courses/big-data-machine-learning/1/steps/81565.
-
Great example of machine learning in practice!
-
Matthew Sutton replied to Abdul Malik Sulemana
Sorry Abdul, I've taken another look at the H2O website - http://www.h2o.ai/product/recommended-systems-for-h2o/ and it doesn't look like the H2O app is supported via phone. I've tried to get in contact with them to see if they can/will make it available on phone and I'll let you know if there's anything available! Sorry again for any inconvenience..
-
Matthew Sutton replied to Abdul Malik Sulemana
I've never tried to run H2O from a phone before, but you might be able to get it to work using the R app: https://play.google.com/store/apps/details?id=appinventor.ai_RInstructor.R2&hl=en And then installing the H2O package.
-
Matthew Sutton replied to Graeme Smith
It looks like the tutorial might be out of date. The key argument has been replaced by destination_frame since the last stable version of H2O. See: http://stackoverflow.com/questions/31442820/unable-to-convert-data-frame-to-h2o-object
For importing the Iris data try: iris.hex = h2o.importFile(path = irisPath, destination_frame = "iris")
-
Matthew Sutton replied to Akram Amari
Hi Akram Amari, here is a direct link http://www.cs.waikato.ac.nz/ml/weka/downloading.html
-
Matthew Sutton replied to Huang Tengda
Hadoop Distributed File system. There is more information at step 1.21 and in the linked article :)
-
Matthew Sutton replied to Simon Bourne
Glad to hear you're doing well now.
Big data can help inform government policy and also guide city planners who are looking to build cancer care services. The disparities in cancer incidence and survival are evident across geographical areas are sometimes attributed to (but not restricted to): environmental factors, screening and diagnosis, migration of...
-
Matthew Sutton replied to Carol McKnight
Machine Learning as a research discipline started with the study of Artificial Intelligence by computer scientists. Over time these methods have become intertwined with statistics and we see a huge overlap in the fields. Generally it refers to analytical methods that have been influenced by computer scientists and statisticians to help make predictions and...
-
Matthew Sutton replied to Marcia Criollo
Really any mathematical knowledge is an asset for big data since they develop logical thinking and problem solving skills. That said, the more practical Mathematics backgrounds to have are; applied mathematics, statistics and operations research. You will get a better idea of the level of mathematical knowledge needed in our next course:...
-
Matthew Sutton replied to Wael Youssef
Nice article!
-
Matthew Sutton replied to Glynn Hinchcliffe
It might have something to do with the software vs hardware visualisation. A similar problem was answered here at VirtualBox https://www.virtualbox.org/ticket/3125. You could try page 10 of the VirtualBox manual here http://download.virtualbox.org/virtualbox/2.1.2/UserManual.pdf. If you still have problems you could try post a ticket to the VirtualBox website
-
Maybe try a different browser?
The script is copied below:
Let’s take a look at sample pseudo code for Mapper and Reducers, of our example of counting words from a text file, for example one presented about the diagram. Mapper pseudo code goes as presented on the slide. The Mapper input is a line of text (string). For each word in the input line, the... -
Matthew Sutton replied to N R
Sorry to hear that I wasn't working. The video is a good wlakhough of what you would've been doing anyway
-
Matthew Sutton replied to chaithu honey
I'm not sure I understand what you mean by countries having restricted data? Unfortunately if you don't have the data then you can't do the analysis. Operations research might be a better way to go in this case.
-
Matthew Sutton replied to chaithu honey
I'll try rattle off a couple of examples.
In business: often you hear the term “data driven” what this means is that the business is trying to incorporate data in its decision making process rather than relying solely on instinct or making business decisions based on a “we always do it like this” attitude. For example in Online shopping the time users spend...
-
SQL has a sharp learning cure. After getting the basics of joins, selects, deletes and creating tables though you should find that you know how to do most of the things you want to do.
-
Matthew Sutton replied to Swaroop Chandre
Thanks Chaithu for the example. Big data is something that in my opinion will have an increasing presence in finance. In particular, Algorithmic trading often deals with large amounts of financial information: http://bit.ly/1atP3jC
-
Matthew Sutton replied to Glynn Hinchcliffe
Those system specs sound good enough. What was going wrong?
-
Matthew Sutton replied to chaithu honey
Basically we want to head from the massive dataset that we have (from possibly Peta or Exabytes) model and analyse it and integrate the results of the analysis into a final summary of the data (in the range of Kiliobytes). Hopefully this helped, but if you have any more questions let us know
-
Hi Helen, "hashing" in computer science refers to indexing items so that they can be retrieved or sorted faster. In Hadoop, when a record has been mapped to a key value pair it will be sent to one of the reducers. The record's key is hashed to determine which reducer the key belongs in. This process can be come a bit more advanced and is related to...
-
Matthew Sutton replied to Julie Lindsay
Another interesting question would be what is not big data? This article describes some of the misconceptions on what BigData actually is http://bit.ly/1Sh7BZ0 .
-
Matthew Sutton replied to Alexander Hanysz
Nice find! were you able to fix up the code?
-
Matthew Sutton replied to Adam Hill
Love the graphics on this site. I saw a similar article about toxicity vs supportiveness for Reddit at http://bit.ly/19hRiGA
-
Matthew Sutton replied to Bupe Tyson
Great article, it really highlights the fact that correlation doesn’t imply causation and how hard it can be to continually calibrate a model. As mentioned in the article, Big data shows a lot of promise in this area; in fact a team at Harvard recently put up a revised version of google's flu tracker. It'll be interesting to see if the model continues to...
-
Matthew Sutton replied to Rosario Sotomayor
Nice link, I liked the example (on page 1) about using Hadoop for analyzing shopping patterns. Hadoop is particularly good with unstructured data like clickstream data from online shopping websites
-
Matthew Sutton replied to Bupe Tyson
You're right to be wary of data visualizations. An old Mark Twain quote says "there are three types of lies - lies, damn lies and statistics". This article talks about how different choices for creating data visualizations can tell completely different stories...
-
Matthew Sutton replied to John Hanchulak
A great source of free American healthcare data is http://www.healthdata.gov. Also for those curious PII stands for personally identifiable information.
-
Matthew Sutton replied to James Wolman
Genomic sequencing has become one of the most exciting areas for big data analysis. Some of the tools we introduce here (Hadoop and spark in particular) are being used to tackle these data related tasks. Check this paper out if you're interested https://www.mapr.com/blog/hadoop-and-genome-sequencing-perfect-match
-
Matthew Sutton replied to Leigh Hastwell
Glad you're finding it interesting! If you can't download Cloudera then watching the video should provide some sense of working with HDFS and how large data can be managed. The course showcases a number of technical methods that you could return to when you encounter some truly large data. If you don't want to do the practical parts, then just reading the...
-
Matthew Sutton replied to Michael Renner
@MichaelRenner Thanks for the feedback. The technical challenges with downloading and installing cloudera can be frustrating. However, we included this software in the lectures since it is both free and widely used in the big data community. And there's no easy way around the download time unfortunately.
We have tried to make the process a bit less daunting... -
Matthew Sutton replied to Abraham Danquah
Hi Abraham, could you let us know what sections were hard or confusing in the course? This is the first time we've run a mooc so your input is appreciated
-
Matthew Sutton replied to Robbie Cappuccio
Hi Roberto, a database schema provides a template for the data. In MySQL a schema is defined and the data is then stored to fit this template. Hadoop uses "schema-on-read" process instead. So for Hadoop we start with our data and add a schema to fit your needs. See this link for more information: http://blog.cask.co/2015/03/schema-on-read-in-action/
-
Matthew Sutton replied to Graciela Pérez Corral
@James Heron
This was taken from the course help https://goo.gl/27KBl2
The VM does not boot or boots with error relating to AMD-V or Intel vt-x
Your computer may have hardware virtualisation disabled. This site may help you diagnose and remedy this problem:
http://www.sysprobs.com/disable-enable-virtualization-technology-bios -
Matthew Sutton replied to Massirfufulay Musa
Great link, I really like the conclusions and future perspectives section.
-
Matthew Sutton replied to Steve Cuddy
There should be a link under the video for the transcript
-
Matthew Sutton replied to Josh Bitossi
The amount of data involved in image processing can be incredible. Facebook processes around 300 Million images and handles upwards of 500Tb daily http://www.cnet.com/news/facebook-processes-more-than-500-tb-of-data-daily/
-
Matthew Sutton replied to taha tadmori
Try test your big data terminology and pokemon knowledge at https://pixelastic.github.io/pokemonorbigdata/
-
Matthew Sutton replied to Helen Tuddenham
To get a really good introduction to Big Data I'd recommend enrolling in all four of the Mini-moocs.
BIG DATA: FROM DATA TO DECISIONS
BIG DATA: STATISTICAL INFERENCE AND MACHINE LEARNING
BIG DATA: MATHEMATICAL MODELLING
BIG DATA: DATA VISUALISATIONOur teaching team all find R to be an excellent platform for statistical analysis and...
-
Matthew Sutton replied to Amy Chow
The three V's are a good way to understand what big data is. Though even data with a small volume can be very complex, and can push the boundaries of current computational methods.