Skip to 0 minutes and 11 seconds Today we’re going to look a bit more at the extensive plotting functionalities in R. More specifically, we’ll look at the ggplot2 package for R and what kinds of plots we can generate with this package. OK! Let’s get started. To save some time, I’ve loaded the iris data into the Explorer already. Now we go to the R console to issue our commands in the R language. The first thing we need to do is install the ggplot2 package, which is the plotting package that
Skip to 0 minutes and 44 seconds we want to use: install.packages(“ggplot2”).OK, it’s finished. Now that we’ve downloaded and installed the package, we can load it into the R environment by using the library function. We use library(ggplot2).Now the library is loaded, we can use it to plot some data. In ggplot2 we construct a plot in layers. We can add several different layers of plots to construct very complex plots, but there’s one layer that is always present in every plot. That is the data layer, which specifies the data that needs to be plotted. The data layer is specified using the ggplot function. With the ggplot function, we specify the data we want to use. In this case, the data is
Skip to 1 minute and 51 seconds referred to using the “rdata” variable: rdata is the name of the variable that refers to the data that we’ve loaded into the Preprocess panel. Then we need to also say which attributes we want to use. This is done using the aesthetics function, the aes function. For the second argument, use the result returned by the aesthetics function. Say x = petallength to specify the petallength attribute as the attribute we want to plot. In this case, you’re just generating a plot based on a single attribute in the data. This is now the data layer for our plot. We also need to add a geometry later, which actually specifies what type of plot we want to generate.
Skip to 2 minutes and 38 seconds Let’s say we want to generate a kernel density estimate based on this attribute that we have selected. Then we add another layer to our plot using the + operator. We call the geometry function for density estimates, geom_density(). OK, let’s try this. Right. Now we have a kernel density estimate for the petallength attribute. On the x-axis we have the value of the petallength attribute, and on the y-axis we have the have the density estimate. You can see that there are two peaks in this density estimate, but you can also see that the plot is not wide enough to cover the entire area that is relevant.
Skip to 3 minutes and 19 seconds We should increase the limits of the plot, and we can do that by adding a call to the xlim function, where we specify the lower limit and the upper limit. Let’s say we use 0 as the lower limit and 8 as the upper limit. That looks better, but perhaps this kernel density estimate is still a little bit too smooth. It doesn’t show enough detail in the data, because the kernels that are used are too wide. Let’s reduce the width of each kernel. We can do that by specifying the adjust argument for the geom_density function. This multiplies the width of each kernel by the given parameter. Let’s say we halve the width of each kernel estimator.
Skip to 4 minutes and 9 seconds Now we get a plot showing a little bit more detail. In Weka, we primarily deal with classification problems. So, really, we should try to take the class information into account in our plot. We can do that by generating three different plots, one for each class value, and combine them into one graph. How do we do that? It’s very simple. We just add another argument to the call of the aesthetics function. Just say the color is given by the “class” attribute in rdata. “Class” is the name of the class attribute in the iris data. We just say that the color is based on the class attribute. Now we get a separate kernel density estimate for each of the three classes.
Skip to 4 minutes and 59 seconds You can see that the distributions for iris_versicolor and iris_virginica overlap a little bit, but iris_setosa is nicely separated. We may want to enhance this plot by filling the area under each estimate. This is also easy. It’s again done by providing an additional argument to the aesthetics function. You just say the fill color should also be based on the class attribute. You can see that there is a little bit of a problem here. We can’t really differentiate the iris_versicolor and the iris_virginica cases. We should introduce some transparency in our plot. We can do that by providing an “alpha” value for our kernel density estimators.
Skip to 5 minutes and 56 seconds This is a values between 0 and 1 that determines the amount of transparency: 1 means no transparency; 0 means totally transparent. Let’s set this to 0.5.Now we have a nice plot of the three kernel density estimates. Let’s say we want to plot the same kind of plot, but for all four attributes in the iris data, not just the petallength attribute. We can also do that, but we need to massage our data a little bit to achieve that. We need to load a library
Skip to 6 minutes and 31 seconds called “reshape2”: library(reshape2). Then, we can call the so called “melt”
Skip to 6 minutes and 44 seconds function to transform our data into an appropriate format: melt(rdata). The new data, the new format, will be stored in ndata. Let’s just have a look at what this data looks like. We can just type in “ndata”, and it will show us the data. You can see that we have 600 instances in the transformed dataset. There are three attributes in the dataset. The class value is given as the value of the first attribute. The name of the attribute is given as the second attribute, and the attribute value is given as the third attribute.
Skip to 7 minutes and 31 seconds Scrolling all the way up to the first instance, we can see the first attribute now is called “class”, the second attribute is called “variable”, and the last attribute is called “value”. We have 600 instances because there are 4 attributes and 150 instances in the original dataset. We now have a separate dataset for each of the attributes. First we have all of the attribute values for the 150 iris flowers for sepallength. Then we have all the 150 iris flowers for sepalwidth. Then we have petallength, and finally we have petalwidth. Now that we have the data in this format we can use the “variable” attribute as a way to generate different plots for each attribute. How do we do that?
Skip to 8 minutes and 23 seconds It’s quite simple. Our X value is now based on the “value” attribute in this transformed data. That is the actual numeric value for each of the attributes. The color is still based on the class, and, at the end, we now use the facet_grid function to generate a grid of facets, where facets are subplots. Here, as arguments for the facet_grid function, we need to specify which attribute should be used for the X dimension of the grid, and which attribute should be used for the Y dimension of the grid. In this case, we only have one meaningful dimension. Let’s say we want to use “variable” as the variable determining the X dimension.
Skip to 9 minutes and 16 seconds Then we use the tilde character to separate the X dimension and the Y dimension. In this case, we don’t have a variable for the Y dimension of the grid, so we just use a full stop. This means there will be just one column in the grid. I forgot to change the name of the data. We want to plot ndata, not rdata. Now you can see that we have a different plot for each attribute. In the first facet, the first row in this case, we have the sepallength. The second row we have the sepalwidth. The third row we have the petallength, and in the fourth row we have the petalwidth.
Skip to 10 minutes and 5 seconds We can also use columns instead of rows simply by swapping the order of the arguments here. We can use a dot on the left-hand side of the tilde and “variable” on the right-hand side. Now we have the kernel density estimates arranged vertically. Now that we have generated a nice-looking plot, we may want to save it as a PDF file. We can do that quite easily, as well. We just need to redirect the output of the plot. We do that by using the PDF function, and we specify the file name, let’s say /Users/Eibe/Documents/test.pdf. Then we simply call the plotting function again.
Skip to 10 minutes and 56 seconds Now it’s actually printing the plot into the PDF file, and to redirect our plot to the window again we just call the dev.off() function. There are many other types of plots that we can generate with ggplot2. We can generate scatter plots, two-dimensional kernel density estimate plots, and many other plots. One very useful type of plot that we cannot generate with Weka’s own graphical user interfaces is a box plot. So let’s generate a box plot for the iris data for each attribute individually, using facet grids. First, we need to specify the data layer again using the ggplot function “ggplot” – let’s, say, use this ndata that I’ve already prepared.
Skip to 11 minutes and 50 seconds And then, we use the aesthetics function to specify what exactly we want to plot. We want to plot the value on the y-axis in a box plot, and we want to use the class to distinguish different box plots on the x-axis. We want the color to be also based on the class. Now, we use the geom_boxplot function to generate box plots, and we use the facet_grid function again to generate the grid of plots. In this case, let’s say, use “variable” to determine the column. As you can see here, we have a really nice set of box plots. First, we have the box plot for sepallength, then for sepalwidth, then for petallength and for petalwidth.
Skip to 12 minutes and 56 seconds So we have generated a fairly complex plot here. You can generate many more types of plots using ggplot2. Hopefully, this has given you a taster.
Using R to plot data
This video demonstrates an R package called ggplot2 that provides extensive plotting capabilities, which can be accessed from Weka. Detailed instructions are given in the accompanying download (these slides do not appear in the video itself).
© University of Waikato, New Zealand. CC Creative Commons Attribution 4.0 International License.