Anthony (Ace) Ebert

I'm a statistics PhD student at QUT. I'm interested in Bayesian analysis and Queueing theory. I script in R.

Location Brisbane, Australia

Activity

Anthony (Ace) Ebert replied to Robert Heelan

Share your findings

17 MAR 2017

Yes, unless you have a statistical model. See "mixture models" https://en.wikipedia.org/wiki/Mixture_model . It's often the case that a machine-learning technique has a statistics counterpart, like clustering and mixture models respectively.
Anthony (Ace) Ebert replied to Anthony (Ace) Ebert

Share your findings

17 MAR 2017

Yes the MSE is quite high, this is for length "call length" as the response rather than confidence_index.

Ideally I would start with all variables and then compute the AIC (Akaike information criterion) for each model with one variable taken out and find the variable which, when removed minimizes AIC. I used the p-value instead, which is a bit naughty but...
Anthony (Ace) Ebert replied to Andres Arroz

Join a study group

15 MAR 2017

Hi Andrez! Welcome to the course. Which python libraries do you use?
Anthony (Ace) Ebert replied to Steven Yull

Testing the model

15 MAR 2017

See 2.17 for my results
Anthony (Ace) Ebert made a comment

Share your findings

15 MAR 2017

Thanks everyone for participating, I thought you might like to see my results for call length. I performed backwise linear regression by starting with all the numeric or integer variables and deleting the one with the highest p-value until no p-value was greater than 0.05. This is the model I ended up with

SELECT LINEAR_REG('linreg_model11a',...
Anthony (Ace) Ebert replied to Steven Yull

Testing the model

15 MAR 2017

Hi Steven, I agree with you. I'm sorry I didn't see this comment sooner.
Anthony (Ace) Ebert replied to Geoff Newton

Explore the data set in your study group

15 MAR 2017

Yes that's the correct response. You're right, the table is created in the background for future use so that you can refer to it in subsequent queries. (As long as you don't close the session - close the terminal window). If you want to export the table as a spreadsheet run the following commands

\o output.txt
select * FROM bank_data;

Click on dbadmin's...
Anthony (Ace) Ebert replied to Robert Heelan

Share your findings

15 MAR 2017

It's an alternative measure to "Between-Cluster Sum of Squares".
Anthony (Ace) Ebert replied to Myo Min Thein

Join a study group

14 MAR 2017

Hi Myo Min Thein, I had never heard of Monitoring and Evaluation (Is that right?) as a process. It's a tricky part of academia actually, there's a lot of debate about how to measure research output - and whether number of journal articles and citations is a good measure.
Anthony (Ace) Ebert replied to Temitope Omotuyole

Join a study group

14 MAR 2017

Hi Temitope, welcome to the course! You're most welcome. Do you have any experience with databases or SQL?
Anthony (Ace) Ebert replied to Robert Heelan

Share your findings

14 MAR 2017

It's a tricky subject. See https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set . Adding more clusters will always increase "Between-Cluster Sum of Squares". So what you do is you keep adding clusters until the increase in "Between-Cluster Sum of Squares" seem insignificant. It's called the elbow method in the wiki article.
Anthony (Ace) Ebert replied to Patrick Rumeci

Building and summarising the model

14 MAR 2017

What do you mean exactly? In discriminant analysis (a related topic in statistics) the model is trained with a training set where the group of each observation is known and then the samples are classified into these groups. For instance Torrence et al (2003) have characteristics of starch grains for known species and then classify sample grains....
Anthony (Ace) Ebert replied to Jean-Luc Giboire

Discuss the problem in your study group

13 MAR 2017

There are two concepts at play here:
- The performance of a classifier (AUC)
- The evidence of an effect (p-value)

It's possible to have strong evidence for classifier that performs badly and vice-versa. The AUC tells us how useful a classifier is, the p-value tells us if there is an effect or not. Obviously for a classifier to perform well the effect...
Anthony (Ace) Ebert replied to Geoff Newton

Explore the data set in your study group

13 MAR 2017

Did you type "vsql –w galaxy" beforehand? The error occurs because the command is interpreted as a bash command (which is like cmd in Windows). It should be interpreted as a vsql command. When you type "vsql -w -galaxy" it should start vsql and then the computer will know you are speaking vsql instead of bash.
Anthony (Ace) Ebert replied to Nikolaos Stavros Kargellis

Logistic transformation

10 MAR 2017

Ah yes I see. You're completely right. The statement "This variable is continuous, between minus infinity and plus infinity." is false.
Anthony (Ace) Ebert replied to Patrick Rumeci

Discuss in your study group

10 MAR 2017

You're right. There's always a model, every summary of data assumes something about how the data was collected or how it relates to the response. If you don't think clearly about the model (explicit or implicit) that you're using then you can run into errors when you try to make predictions.
Anthony (Ace) Ebert replied to Nikolaos Stavros Kargellis

Logistic transformation

10 MAR 2017

y = log(Pr(Pass)/(1-Pr(Pass)) can take negative values. The idea is that once you do this transformation you can set y = b0 + b1 * x for some x value. Any line built this way will predict negative values of y for some value of x. All values of y correspond to a probability between zero and one, so everything is good!
Anthony (Ace) Ebert replied to Aminath Shausan

Join the discussion

08 MAR 2017

Correct. "yes/no" is an example of a binary categorical response. Logistic regression can be used for any categorical response.
Anthony (Ace) Ebert replied to Andrew Pridmore

Join the discussion

08 MAR 2017

That's a view shared by Hadley Wickham, a prominent data scientist. https://peadarcoyle.wordpress.com/2015/08/02/interview-with-a-data-scientist-hadley-wickham/ (See point number 5)
Anthony (Ace) Ebert replied to Andrei Stasishin

Explore the data set in your study group

08 MAR 2017

that's the idea! We split the table up into a training and test set. We estimate the parameters of the model with the training set and we test it on the test set. It's called cross-validation. The purpose of doing this, rather than just using the entire table to estimate the parameters, is to assess goodness of fit and prevent overfitting.
Anthony (Ace) Ebert made a comment

Share your findings

06 MAR 2017

Thank you everyone for your comments. The SQL langauge used by Vertica (vsql) is the industry standard for querying (getting information from) databases. The Vertica Analytics platform makes this fast and adds important machine learning tools to the SQL language.

In week 1 we saw how to use SQL commands to query a database. In week 2 we see how to use the...
Anthony (Ace) Ebert replied to Dmitry B

Share your findings

06 MAR 2017

The galaxybank data already exists on the VM. You don't need to enable it. http://external-apps.qut.edu.au/futurelearn/resources/hpe/instructions/index.html.utf8 Can you locate the terminal icon?
Anthony (Ace) Ebert replied to Gulnara N

Preparing the data for model-building

27 FEB 2017

Hi Gulnara, have a look at the next page. If you're still having trouble, have a look at www.sql-tutorial.net/sql-cheat-sheet.pdf.
Anthony (Ace) Ebert replied to Andrei Stasishin

Explore the data set in your study group

24 FEB 2017

That might not be a great way to split the data - ID is usually related to what order the customers were entered into the database. If you split the training and test data the way you described you will get more of the early entries in the training set. The idea is that the training set should look like the test set.
Anthony (Ace) Ebert replied to Whitney Gibbs

Describing the problem

23 FEB 2017

Good catch! Thank you. We'll fix that soon.
Anthony (Ace) Ebert replied to Sarah Chacko

Reflect on your learning

23 FEB 2017

Yes perhaps we should leave those questions until the second week.

You're right. Vertica is like an SQL database, but it runs on a cluster rather than a central database server. There are also extra machine learning tools that are not available on SQL.

https://www.quora.com/What-are-the-main-differences-between-Vertica-and-SQL-syntax
Anthony (Ace) Ebert replied to Antoniet Ama Aggrey

Share your findings

23 FEB 2017

Hi Antoniet, I answered your question on the previous page.
Anthony (Ace) Ebert replied to Antoniet Ama Aggrey

Explore the data set in your study group

23 FEB 2017

The instructions are found on the previous page:

I have switched the words from the previous page to match the task.

CREATE TABLE bank_data_trng AS
SELECT * FROM bank_data
TABLESAMPLE(70);

CREATE TABLE bank_data_test AS
SELECT * FROM bank_data EXCEPT
SELECT * FROM bank_data_trng;

:)
Anthony (Ace) Ebert replied to Rama Rai

Welcome to the course

20 FEB 2017

Welcome to the course :) This course is not about visualization so there's no need to review how to make graphs or charts. The course is aimed those who have some experience with databases, SQL or data science.
Anthony (Ace) Ebert made a comment

Welcome to the course

20 FEB 2017

Welcome to the Predictive Analytics online course, a joint effort of the Queensland University of Technology (QUT) and Hewlett Packard Enterprise (HPE). In this course we use the HPE Vertica Analytics platform to teach machine learning techniques for 'big data'.

My name is Anthony Ebert. I am a mentor within this course and a statistics PhD student at QUT....
Anthony (Ace) Ebert replied to Virender Verma

The basics

11 OCT 2016

Have a look at http://www.feynmanlectures.caltech.edu/II_31.html if you feel so inclined.
Anthony (Ace) Ebert replied to Raul Mejia

The basics

04 OCT 2016

Have a look at these links:
- https://math.stackexchange.com/questions/1134809/are-there-any-differences-between-tensors-and-multidimensional-arrays
- https://news.ycombinator.com/item?id=9506774

The short answer is that they're "the same" but interpreted differently. For instance you can interpret a 2x2 matrix "A" as just a bunch of numbers or you can...
Anthony (Ace) Ebert replied to Allan Kinnaird

Other types of decision trees

19 SEP 2016

Yes, we're testing to see how small changes in the dataset affect the structure of the trees.
Anthony (Ace) Ebert replied to Allan Kinnaird

Other types of decision trees

18 SEP 2016

The same row in the dataset can be represented multiple times in the sample because the sampling is done with replacement.
Anthony (Ace) Ebert replied to Patrick Rumeci

Simple decision tree coded in R

15 SEP 2016

Read Francois' answer
Anthony (Ace) Ebert replied to Yves Nicodeme

Let’s do it! Prepare Jessica’s bank data

13 SEP 2016

Thanks E.S. Dempsey, so you tried install.packages('dplyr') ? You need the quote marks for install.packages() but they're optional for library() .
Anthony (Ace) Ebert replied to Lee Sefton

Let’s do it! Prepare Jessica’s bank data

11 SEP 2016

ggplot2 is based on the "Grammar of graphics". The plot is made up of "layers", ggplot brings in the data, aes specifies what plot elements are related to what variables, geom_bar specifies a bar plot. The "+" combines the plot elements....
Anthony (Ace) Ebert replied to Roger Lear

Let’s do it! Prepare Jessica’s bank data

11 SEP 2016

I got the same answer for number of rows, I think the document needs to be changed. Sorry for that!
Anthony (Ace) Ebert replied to Patrick Rumeci

Let’s do it! Go with the H2O Flow

11 SEP 2016

Sorry to hear it wasn't easy. Did you try install.packages("h2o") in RStudio?
Anthony (Ace) Ebert replied to KUMUYI BABATUNDE

Let’s do it! Prepare Jessica’s bank data

11 SEP 2016

you shouldn't have to download it from the website, you can do it through R Studio. Try install.packages("h2o")
Anthony (Ace) Ebert replied to Helen O'Brien Quinn

Let’s do it! Prepare Jessica’s bank data

11 SEP 2016

Sorry I don't know the answer to that one. Can you restart RStudio and repeat the steps? Sorry that's the best I can do!
Anthony (Ace) Ebert replied to Yves Nicodeme

Let’s do it! Prepare Jessica’s bank data

11 SEP 2016

the function "%>%" is defined in the dplyr library. Did you run library(dplyr) ?
Anthony (Ace) Ebert replied to Faheem Khan

Let’s do it! Prepare Jessica’s bank data

11 SEP 2016

Otherwise you can place the FLbigdataStats folder in C:/FLbigdataStats . Then within the instruction run filePath = "C:/FLbigdataStats/bank_customer_data.csv" instead of filePath = "~/FLbigdataStats/bank_customer_data.csv" .
Anthony (Ace) Ebert replied to Bonnie Fung

Let’s do it! Go with the H2O Flow

11 SEP 2016

You may need to restart and initialise H2o in RStudio
Anthony (Ace) Ebert replied to Boitumelo Tlhalerwa-Tau

How do machines learn?

07 SEP 2016

although the programmers who built Deep Blue can't play chess as well as it can!
Anthony (Ace) Ebert replied to Anne Kasperski

Let’s do it! PCA

06 SEP 2016

The aim of PCA is to look at where the major sources of variation are in the explanatory variables, it has nothing to do with the response variable (price). With 11 variables there is sure to be some strong correlations between the explanatory variables, so we can instead use combinations of the variables in the regression.
Anthony (Ace) Ebert replied to Nordine Ouidja

Let’s do it! Go with the H2O Flow

06 SEP 2016

If you input a person's information into the model it will output a predictor value. You can use the predictor value along with a threshold to decide whether to call someone. The threshold chosen is a tradeoff between your true positive rate (tpr) and your false positive rate (fpr). That's what the ROC curve shows - the curves of good predictors are as close...
Anthony (Ace) Ebert replied to Lee Sefton

The Apache Spark advantage

21 AUG 2016

https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/sql
Anthony (Ace) Ebert replied to Clive Sims

How do you use Pig to find a toilet?

21 AUG 2016

You have all the files already on your virtual machine
Anthony (Ace) Ebert replied to Clive Sims

How do you use Pig to find a toilet?

21 AUG 2016

https://github.com/QUT-BDA-MOOC/FLbigdataD2D/blob/master/example_pig_toilets_abc/abc-toilets-example.pig
Anthony (Ace) Ebert replied to Caroline Roberts

How do you use Pig to find a toilet?

21 AUG 2016

You have them already on your virtual machine.
Run the following from the command terminal in the virtual machine:

cd ~/Desktop/FLbigdataD2D/example_pig_toilets_abc
pig -x local abc-toilets-example.pig
Anthony (Ace) Ebert replied to Adrian Iordache

Word counting example

21 AUG 2016

Sorry I don't quite get the first part of your question. Did you fix the problem?

For the second part. Here's one way. Create a file on the desktop called pigscript.pig

type these contents into the file (up to DUMP out;) :

lineoftext = LOAD 'pg200.txt' AS (line);
words = FOREACH lineoftext GENERATE flatten(TOKENIZE(line)) AS word;
grouped = GROUP...
Anthony (Ace) Ebert replied to Kelvin Chester Galman

Word counting example

21 AUG 2016

The error mentions wordcount.pig but we don't actually run that in this example, could you tell me what you did?
Anthony (Ace) Ebert replied to Fritz Kruse

Word counting example

21 AUG 2016

you type it in the terminal on your virtual machine. It should say "[cloudera@quickstart Desktop]$ " or something like that depending on where you are.

You then type "pig -x local" to enter the grunt shell in local mode (without the quotation marks).

The operating system of the Virtual Machine you are using is CentOS which is a Linux distribution. The...
Anthony (Ace) Ebert replied to F Porta

A closer look at Pig commands

18 AUG 2016

In computer programming speak you have two types of scrips source and binaries. This is a rough explanation: To be read by a computer source-code (Java, Pig, R, etc...) must be converted into machine code (0010011100....). You can run source-code but the computer will have to translate it while it runs, compiling code turns it into a binary - it runs faster,...
Anthony (Ace) Ebert made a comment

Pig Latin

18 AUG 2016

Edges = Lines
Nodes = Circles
just so everyone is clear
Anthony (Ace) Ebert replied to F Porta

Pig Latin

18 AUG 2016

Edges are the lines. Nodes are the circles. Mathematically, the nodes can be represented by a set {a,b,c....} and the edges can be represented as pairs of nodes {a,b}, {a,d}, {c,d}, ....
Anthony (Ace) Ebert replied to Joanna Leng

Your turn

18 AUG 2016

Is there not enough room on your hard drive? Oh, I think you mean Virtual Box
Anthony (Ace) Ebert replied to Pasquale Pelliccia

The big issues

18 AUG 2016

Have a look at this https://www.usenix.org/system/files/login/articles/1908-shvachko.pdf
Anthony (Ace) Ebert replied to Anthony Ellis

Your turn

18 AUG 2016

Apparently there is! https://cran.r-project.org/web/packages/Rfacebook/index.html
Anthony (Ace) Ebert replied to Anthony Ellis

Your turn

16 AUG 2016

If you use R there's a package to extract tweets from twitter, it's called twitteR. https://www.r-bloggers.com/getting-started-with-twitter-in-r/
Anthony (Ace) Ebert replied to Pasquale Pelliccia

The big issues

16 AUG 2016

The data is distributed across a computing cluster with plenty of redundancy so if some of the computers die, the database lives on.
Anthony (Ace) Ebert replied to Jamil Tero

Your turn

16 AUG 2016

You don't need to navigate to the data directory. It's not a directory in the traditional sense.

"The files inside HDFS (or more accurately: the blocks that make them up) are stored in a particular directory managed by the DataNode service, but the files will named only with block ids. You cannot interact with HDFS-stored files using ordinary Linux file...
Anthony (Ace) Ebert replied to Madeleine Harwood

SQL (Structured Query Language)

11 AUG 2016

https://lagunita.stanford.edu/courses/DB/SQL/SelfPaced/about
Anthony (Ace) Ebert replied to Allan Kinnaird

Data to decisions

11 AUG 2016

The decisions is often more complicated than yes/no
Anthony (Ace) Ebert replied to Stephen Howells

Cancer atlas

10 AUG 2016

Yes the risk of diagnosis is scaled for population size. Of course! The paper is https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3039552/

"The SIR is an estimate of relative risk within each area which compares the observed counts against an expected number of counts, based on the population size."...
Anthony (Ace) Ebert replied to Alastair Aitken

What is analytics?

09 AUG 2016

This is a big question in statistics and science generally! There are those who believe that it involves a bit of intuition, there are those who believe that there are methods which rational people should always use. It's called the Problem of Induction and no-one agrees!
Anthony (Ace) Ebert replied to Aminath Shausan

Get the big picture

09 AUG 2016

How big should a dataset be for it to be considered "Big Data"? No everyone agrees! It's discussed on the next page.
Anthony (Ace) Ebert replied to Christianne S

Why does visualisation play an important role in big data analytics?

07 JUL 2016

"One hundred petabytes (which is equal to 100 million gigabytes) is a very large number indeed – roughly equivalent 700 years of full HD-quality movies. Storing it is a challenge. At CERN, the bulk of the data (about 88 petabytes) is archived on tape using the CERN Advanced Storage system (CASTOR) and the rest (13 petabytes) is stored on the EOS disk pool...
Anthony (Ace) Ebert replied to Helen Tuddenham

Reflection

07 JUL 2016

Thank you for your feedback.
Anthony (Ace) Ebert replied to Laurence Anthony

Investigating the effect of customer satisfaction

07 JUL 2016

Ahh sorry! 'columns' in the interface controls the information shown horizontally, 'rows' controls the informational shown vertically. Is that what you meant?
Anthony (Ace) Ebert replied to Peta Hopkins

Investigating the effect of age

06 JUL 2016

At the start: the columns in the "yes" pane are smaller than those in the "no" column. This is because there are more "no" answers than "yes" answers (This can be seen by pulling "Age (bin)" off the Columns bar).

When the compute using "Pane" option is selected the columns in the "yes" pane are the percentages of records that answered "yes".
Anthony (Ace) Ebert replied to Peta Hopkins

Investigating the effect of age

06 JUL 2016

In the plot the panes are the spaces created by the upper columns.

http://onlinehelp.tableau.com/v9.3/pro/online/windows/en-us/help.htm#viewparts_panes.html%3FTocPath%3DBuilding%2520Data%2520Views%7CParts%2520of%2520the%2520View%7C_____3
Anthony (Ace) Ebert replied to Maldwyn Palmer

Investigating the effect of customer satisfaction

06 JUL 2016

Thanks for your feedback. Try pulling the 'Low' column of the plot to the right of 'Satisfactory'.
Anthony (Ace) Ebert replied to Ghee Ong

Investigating the effect of customer satisfaction

06 JUL 2016

There should be entries in the Marks pane that you can delete (beneath the squares). This will reverse the changes.
Anthony (Ace) Ebert replied to Laurence Anthony

Investigating the effect of customer satisfaction

06 JUL 2016

Thanks for the feedback, I'll pass that on. There's a text file in the same folder as the bank_customer_data.csv with an explanation of the dataset.
Anthony (Ace) Ebert replied to Georgios Anestis

Your turn!

06 JUL 2016

French, but the lesson is don't invade Russia. It's cold. https://upload.wikimedia.org/wikipedia/commons/5/5d/Minard_map_of_napoleon.png
Anthony (Ace) Ebert replied to CHUMA N. MALUMA

Your turn!

06 JUL 2016

Here's the English version: https://upload.wikimedia.org/wikipedia/commons/5/5d/Minard_map_of_napoleon.png
Anthony (Ace) Ebert replied to Nsobya Micheal

Problem space

05 JUL 2016

Consider the Aesthetic and minimalist design heuristic. http://www.gorillatourbooking.com has the same information repeated on the left and right.
Anthony (Ace) Ebert replied to Alan Foweraker

Why does visualisation play an important role in big data analytics?

05 JUL 2016

On the other hand - Remember that petabyte is not a number it's an amount of information. If 1 rice grain could store 1 byte then it wouldn't matter if you asked for a petabyte of rice or a petabyte of rice bags. If I wanted 100GB of storage space it wouldn't matter if I asked for a 100GB of usb sticks or 100GB of bags of usb sticks.
Anthony (Ace) Ebert replied to Alan Foweraker

Why does visualisation play an important role in big data analytics?

05 JUL 2016

I see your point Alan. Thank you! I'll pass on your comment.
Anthony (Ace) Ebert made a comment

Your turn!

05 JUL 2016

This may help! Map translated into English. https://upload.wikimedia.org/wikipedia/commons/5/5d/Minard_map_of_napoleon.png
Anthony (Ace) Ebert replied to Natalie Meintjes

Your turn!

05 JUL 2016

Almost, I don't think elevation is shown anywhere.

Harnessing AI in Marketing and Communication

Samuel Johnson’s Rasselas: An Introduction

The Online Educator: People and Pedagogy

How to Succeed at: Interviews

Harnessing AI in Marketing and Communication

Samuel Johnson’s Rasselas: An Introduction

The Online Educator: People and Pedagogy

How to Succeed at: Interviews

Anthony (Ace) Ebert

Activity

About FutureLearn

Using FutureLearn

Need some help?

Popular Subjects

Developing Skills

Small Print