Anthony (Ace) Ebert

Anthony (Ace) Ebert

I'm a statistics PhD student at QUT. I'm interested in Bayesian analysis and Queueing theory. I script in R.

Location Brisbane, Australia

Activity

  • Yes, unless you have a statistical model. See "mixture models" https://en.wikipedia.org/wiki/Mixture_model . It's often the case that a machine-learning technique has a statistics counterpart, like clustering and mixture models respectively.

  • Yes the MSE is quite high, this is for length "call length" as the response rather than confidence_index.

    Ideally I would start with all variables and then compute the AIC (Akaike information criterion) for each model with one variable taken out and find the variable which, when removed minimizes AIC. I used the p-value instead, which is a bit naughty but...

  • Hi Andrez! Welcome to the course. Which python libraries do you use?

  • See 2.17 for my results

  • Thanks everyone for participating, I thought you might like to see my results for call length. I performed backwise linear regression by starting with all the numeric or integer variables and deleting the one with the highest p-value until no p-value was greater than 0.05. This is the model I ended up with

    SELECT LINEAR_REG('linreg_model11a',...

  • Hi Steven, I agree with you. I'm sorry I didn't see this comment sooner.

  • Yes that's the correct response. You're right, the table is created in the background for future use so that you can refer to it in subsequent queries. (As long as you don't close the session - close the terminal window). If you want to export the table as a spreadsheet run the following commands

    \o output.txt
    select * FROM bank_data;

    Click on dbadmin's...

  • It's an alternative measure to "Between-Cluster Sum of Squares".

  • Hi Myo Min Thein, I had never heard of Monitoring and Evaluation (Is that right?) as a process. It's a tricky part of academia actually, there's a lot of debate about how to measure research output - and whether number of journal articles and citations is a good measure.

  • Hi Temitope, welcome to the course! You're most welcome. Do you have any experience with databases or SQL?

  • It's a tricky subject. See https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set . Adding more clusters will always increase "Between-Cluster Sum of Squares". So what you do is you keep adding clusters until the increase in "Between-Cluster Sum of Squares" seem insignificant. It's called the elbow method in the wiki article.

  • What do you mean exactly? In discriminant analysis (a related topic in statistics) the model is trained with a training set where the group of each observation is known and then the samples are classified into these groups. For instance Torrence et al (2003) have characteristics of starch grains for known species and then classify sample grains....

  • There are two concepts at play here:
    - The performance of a classifier (AUC)
    - The evidence of an effect (p-value)

    It's possible to have strong evidence for classifier that performs badly and vice-versa. The AUC tells us how useful a classifier is, the p-value tells us if there is an effect or not. Obviously for a classifier to perform well the effect...

  • Did you type "vsql –w galaxy" beforehand? The error occurs because the command is interpreted as a bash command (which is like cmd in Windows). It should be interpreted as a vsql command. When you type "vsql -w -galaxy" it should start vsql and then the computer will know you are speaking vsql instead of bash.

  • Ah yes I see. You're completely right. The statement "This variable is continuous, between minus infinity and plus infinity." is false.

  • You're right. There's always a model, every summary of data assumes something about how the data was collected or how it relates to the response. If you don't think clearly about the model (explicit or implicit) that you're using then you can run into errors when you try to make predictions.

  • y = log(Pr(Pass)/(1-Pr(Pass)) can take negative values. The idea is that once you do this transformation you can set y = b0 + b1 * x for some x value. Any line built this way will predict negative values of y for some value of x. All values of y correspond to a probability between zero and one, so everything is good!

  • Correct. "yes/no" is an example of a binary categorical response. Logistic regression can be used for any categorical response.

  • That's a view shared by Hadley Wickham, a prominent data scientist. https://peadarcoyle.wordpress.com/2015/08/02/interview-with-a-data-scientist-hadley-wickham/ (See point number 5)

  • that's the idea! We split the table up into a training and test set. We estimate the parameters of the model with the training set and we test it on the test set. It's called cross-validation. The purpose of doing this, rather than just using the entire table to estimate the parameters, is to assess goodness of fit and prevent overfitting.

  • Thank you everyone for your comments. The SQL langauge used by Vertica (vsql) is the industry standard for querying (getting information from) databases. The Vertica Analytics platform makes this fast and adds important machine learning tools to the SQL language.

    In week 1 we saw how to use SQL commands to query a database. In week 2 we see how to use the...

  • The galaxybank data already exists on the VM. You don't need to enable it. http://external-apps.qut.edu.au/futurelearn/resources/hpe/instructions/index.html.utf8 Can you locate the terminal icon?

  • Hi Gulnara, have a look at the next page. If you're still having trouble, have a look at www.sql-tutorial.net/sql-cheat-sheet.pdf.

  • That might not be a great way to split the data - ID is usually related to what order the customers were entered into the database. If you split the training and test data the way you described you will get more of the early entries in the training set. The idea is that the training set should look like the test set.

  • Good catch! Thank you. We'll fix that soon.

  • Yes perhaps we should leave those questions until the second week.

    You're right. Vertica is like an SQL database, but it runs on a cluster rather than a central database server. There are also extra machine learning tools that are not available on SQL.

    https://www.quora.com/What-are-the-main-differences-between-Vertica-and-SQL-syntax

  • Hi Antoniet, I answered your question on the previous page.

  • The instructions are found on the previous page:

    I have switched the words from the previous page to match the task.

    CREATE TABLE bank_data_trng AS
    SELECT * FROM bank_data
    TABLESAMPLE(70);

    CREATE TABLE bank_data_test AS
    SELECT * FROM bank_data EXCEPT
    SELECT * FROM bank_data_trng;

    :)

  • Welcome to the course :) This course is not about visualization so there's no need to review how to make graphs or charts. The course is aimed those who have some experience with databases, SQL or data science.

  • Welcome to the Predictive Analytics online course, a joint effort of the Queensland University of Technology (QUT) and Hewlett Packard Enterprise (HPE). In this course we use the HPE Vertica Analytics platform to teach machine learning techniques for 'big data'.

    My name is Anthony Ebert. I am a mentor within this course and a statistics PhD student at QUT....

  • Have a look at http://www.feynmanlectures.caltech.edu/II_31.html if you feel so inclined.

  • Have a look at these links:
    - https://math.stackexchange.com/questions/1134809/are-there-any-differences-between-tensors-and-multidimensional-arrays
    - https://news.ycombinator.com/item?id=9506774

    The short answer is that they're "the same" but interpreted differently. For instance you can interpret a 2x2 matrix "A" as just a bunch of numbers or you can...

  • Yes, we're testing to see how small changes in the dataset affect the structure of the trees.

  • The same row in the dataset can be represented multiple times in the sample because the sampling is done with replacement.

  • Read Francois' answer

  • Thanks E.S. Dempsey, so you tried install.packages('dplyr') ? You need the quote marks for install.packages() but they're optional for library() .

  • ggplot2 is based on the "Grammar of graphics". The plot is made up of "layers", ggplot brings in the data, aes specifies what plot elements are related to what variables, geom_bar specifies a bar plot. The "+" combines the plot elements....

  • I got the same answer for number of rows, I think the document needs to be changed. Sorry for that!

  • Sorry to hear it wasn't easy. Did you try install.packages("h2o") in RStudio?

  • you shouldn't have to download it from the website, you can do it through R Studio. Try install.packages("h2o")

  • Sorry I don't know the answer to that one. Can you restart RStudio and repeat the steps? Sorry that's the best I can do!

  • the function "%>%" is defined in the dplyr library. Did you run library(dplyr) ?

  • Otherwise you can place the FLbigdataStats folder in C:/FLbigdataStats . Then within the instruction run filePath = "C:/FLbigdataStats/bank_customer_data.csv" instead of filePath = "~/FLbigdataStats/bank_customer_data.csv" .

  • You may need to restart and initialise H2o in RStudio

  • although the programmers who built Deep Blue can't play chess as well as it can!

  • The aim of PCA is to look at where the major sources of variation are in the explanatory variables, it has nothing to do with the response variable (price). With 11 variables there is sure to be some strong correlations between the explanatory variables, so we can instead use combinations of the variables in the regression.

  • If you input a person's information into the model it will output a predictor value. You can use the predictor value along with a threshold to decide whether to call someone. The threshold chosen is a tradeoff between your true positive rate (tpr) and your false positive rate (fpr). That's what the ROC curve shows - the curves of good predictors are as close...

  • You have all the files already on your virtual machine

  • You have them already on your virtual machine.
    Run the following from the command terminal in the virtual machine:

    cd ~/Desktop/FLbigdataD2D/example_pig_toilets_abc
    pig -x local abc-toilets-example.pig

  • Sorry I don't quite get the first part of your question. Did you fix the problem?

    For the second part. Here's one way. Create a file on the desktop called pigscript.pig

    type these contents into the file (up to DUMP out;) :

    lineoftext = LOAD 'pg200.txt' AS (line);
    words = FOREACH lineoftext GENERATE flatten(TOKENIZE(line)) AS word;
    grouped = GROUP...

  • The error mentions wordcount.pig but we don't actually run that in this example, could you tell me what you did?

  • you type it in the terminal on your virtual machine. It should say "[cloudera@quickstart Desktop]$ " or something like that depending on where you are.

    You then type "pig -x local" to enter the grunt shell in local mode (without the quotation marks).

    The operating system of the Virtual Machine you are using is CentOS which is a Linux distribution. The...

  • In computer programming speak you have two types of scrips source and binaries. This is a rough explanation: To be read by a computer source-code (Java, Pig, R, etc...) must be converted into machine code (0010011100....). You can run source-code but the computer will have to translate it while it runs, compiling code turns it into a binary - it runs faster,...

  • Edges = Lines
    Nodes = Circles
    just so everyone is clear

  • Edges are the lines. Nodes are the circles. Mathematically, the nodes can be represented by a set {a,b,c....} and the edges can be represented as pairs of nodes {a,b}, {a,d}, {c,d}, ....

  • Is there not enough room on your hard drive? Oh, I think you mean Virtual Box

  • If you use R there's a package to extract tweets from twitter, it's called twitteR. https://www.r-bloggers.com/getting-started-with-twitter-in-r/

  • The data is distributed across a computing cluster with plenty of redundancy so if some of the computers die, the database lives on.

  • You don't need to navigate to the data directory. It's not a directory in the traditional sense.

    "The files inside HDFS (or more accurately: the blocks that make them up) are stored in a particular directory managed by the DataNode service, but the files will named only with block ids. You cannot interact with HDFS-stored files using ordinary Linux file...

  • The decisions is often more complicated than yes/no

  • Yes the risk of diagnosis is scaled for population size. Of course! The paper is https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3039552/

    "The SIR is an estimate of relative risk within each area which compares the observed counts against an expected number of counts, based on the population size."...

  • This is a big question in statistics and science generally! There are those who believe that it involves a bit of intuition, there are those who believe that there are methods which rational people should always use. It's called the Problem of Induction and no-one agrees!

  • How big should a dataset be for it to be considered "Big Data"? No everyone agrees! It's discussed on the next page.

  • "One hundred petabytes (which is equal to 100 million gigabytes) is a very large number indeed – roughly equivalent 700 years of full HD-quality movies. Storing it is a challenge. At CERN, the bulk of the data (about 88 petabytes) is archived on tape using the CERN Advanced Storage system (CASTOR) and the rest (13 petabytes) is stored on the EOS disk pool...

  • Thank you for your feedback.

  • Ahh sorry! 'columns' in the interface controls the information shown horizontally, 'rows' controls the informational shown vertically. Is that what you meant?

  • At the start: the columns in the "yes" pane are smaller than those in the "no" column. This is because there are more "no" answers than "yes" answers (This can be seen by pulling "Age (bin)" off the Columns bar).

    When the compute using "Pane" option is selected the columns in the "yes" pane are the percentages of records that answered "yes".

  • Thanks for your feedback. Try pulling the 'Low' column of the plot to the right of 'Satisfactory'.

  • There should be entries in the Marks pane that you can delete (beneath the squares). This will reverse the changes.

  • Thanks for the feedback, I'll pass that on. There's a text file in the same folder as the bank_customer_data.csv with an explanation of the dataset.

  • French, but the lesson is don't invade Russia. It's cold. https://upload.wikimedia.org/wikipedia/commons/5/5d/Minard_map_of_napoleon.png

  • Consider the Aesthetic and minimalist design heuristic. http://www.gorillatourbooking.com has the same information repeated on the left and right.

  • On the other hand - Remember that petabyte is not a number it's an amount of information. If 1 rice grain could store 1 byte then it wouldn't matter if you asked for a petabyte of rice or a petabyte of rice bags. If I wanted 100GB of storage space it wouldn't matter if I asked for a 100GB of usb sticks or 100GB of bags of usb sticks.

  • I see your point Alan. Thank you! I'll pass on your comment.

  • Almost, I don't think elevation is shown anywhere.