Skip main navigation

Tutorial : Positive Matrix Factorization (PMF)

Tutorial Positive matrix factorization (PMF). Dr Firoz will explain more.
Hello everyone. Here I want to demonstrate EPA PMF procedure. It is from the Environmental Protection Agency, United State and the software name is Positive Matrix Factorization Model. This is the platform interface for EPA PMF software and this interface works on windows. Unfortunately, it doesn’t work for macbook but it works for windows 7 until windows version 10. If we want to execute EPA PMF we need to follow the step, the entire sequence step by step. At the beginning, you need to click on ‘Model data’ and the ‘Data file’. Then you will come across this interface here. You need to upload your data (concentration data file) here. You need to upload from your own folder or data.
I uploaded1 example data that was provided by this EPA PMF software. And for EPA PMF, On top of the concentration file we also need another file. The file is called ‘Uncertainty data file’. This uncertainty data file that we need to prepare from our concentration data file. So I suggest that you take a look at the user guideline for EPA PMF. In that user guideline (pdf file), there are suggestions on how you can create uncertainty data files So in the uncertainty data file, we have an example data file for uncertainty that I already link here.
And then, what you need to do is you need to put here your output data file where you want to keep your calculation data file after you will execute EPA PMF. So you can create your own folder name it as ‘new’. Maybe I can show you how to create new folder.
If we want to calculate our data by using EPA PMF there will be a number of data files that PMF will create. And that data file, if you want to put a file name you need to put a Prefix. So Prefix will always come with the data file that will be created by EPA PMF. And then I put here, for example just ‘test’. Next, you can choose the output data file format that you want. There are a number of options here. So I choose ‘Comma Delimited Text’. And then, at the bottom, you need to upload your configuration file. Because this configuration file is very important. The time you do your all calculation it will be stored in the configuration file.
So next time, once you want to redo your calculation, you can just upload your configuration file. My configuration file here is test.cfg. Now, let’s proceed. If we click on here (Base Model), You can see how we can proceed with our EPA PMF calculation. At the bottom, you can see the data file (concentration data file). We already have an uncertainty data file. Now, if we want to proceed with PMF calculation. What we need to do, we need to run. How many numbers you want to run? I simply put 20. You can choose even more than 20. And, number of factor. This is most important. How many factors from your data point do you want to calculate?
So you need to do a number of trial and error. To decide which number of factors is more appropriate for your data sheet. For example my data sheet here, I just want to randomly put 5. Just to show you. And then click on the run button.
It is now in progress. Since the algorithm is more stronger compared to other receptor modelling tools. So it sometimes take more time. And this number of 5 factor, the result will come for 20 runs. For 20 runs, there will be some value here called Q value. Q is mathematical count of PMF that helps us to control the error in our calculation. That is most helpful tools for PMF that Q values we can adjust. Because Q values control the error in our calculation. Here, you can see the Q value for robust mode Q value and also for all other runs we can see the Q value.
And our automatic way (the method) selected the run two is most suitable to consider because the Q value for robust and Q value for true is much nearer. But, the difference is more close compared to other run. So, choose the second run of our EPA PMF for further calculation. And then, if we want to go over to see the error in our calculation we need to proceed with the second and third step here. Error estimation and base model boostrap estimation. But, as I told you that it will take more time because the number of runs here, the boostrap run we choose 100. Sometimes we need to make it more than 100. So, it will take more time to process.
In that case, before I proceed with error estimation, I want to show you the result that we have from our initial EPA PMF calculation. The result is here you can see. If we go over the factor profile that is most important. The factor profile here. This is a factor profile for run two factor 1 Factor 1 here, you can see on the left side, concentration of the variable here. variables from Cadmium (Cd) until mass concentration and the concentration on the left y-axis bar. On the right y-axis bar is percentage of the variables for factor 1. And we have to explain factor 1 as representing which source. It all depends on the variable that will have the most contribution.
The more predominant. Here we can see that lead is predominant for factor 1. Lead (Pb) and Cadmium (Cd). So, if we proceed to our factor 2 and factor 3 we also can see their profile. Every profile from EPA PMF. The profile will help us to explain factor 1 is explaining which kind of source. For the mass concentration of PM 2.5, his mass concentration contributes to many sources. And each of the factors will represent the sources. This is how all of the factors (factor 1 to factor 5), we can see the profile and we can explain them, what sources they are representing. If you want to make under one plot (for factor 1 until factor 5), that you can also do.
You just right-click here and select stacked graphs. So you will have factor 1 until factor 5 in one plot. This one plot you can make use for your thesis and research. You can save it, just right-click and save it as jpeg or pdf and etc. Let’s proceed now. How about our EPA PMF calculation here. Is it acceptable for us? So how can we make sure that? If you click on the ‘Obs/Predict Scatter Plot’, you can see all the variables here from cadmium until mass concentration. We can look one by one. For example cadmium. How is cadmium? The concentration we have from our experiement and cadmium concentration that EPA PMF predicted. If we make a correlation plot.
How does the correlation plot look alike? So, the correlation plot will tell us how about the modelling Is it acceptable? Is it appropriate for us to go further? We can see from each of variables that R value may help us to explain. The R square value here for cadmium is very low (0.23). This is somewhat a very poor correlation for our experimental concentration if we compared with the modelling concentration that we obtained from EPA PMF. Then let’s proceed with cooper. You can see, the experimental concentration for cooper if we compared with EPA PMF predicted concentration, the correlation value is 0.99. It says that, for cooper, EPA PMF is well correlated. Strongly correlated. So, for cooper, it is very appropriate.
How about for lead? It is also showing 0.97 the R square value. Many of the variables we can see the correlation is must stronger but not all. Since not all the variables is having stronger correlation What we need to do? We need to trial and error with increasing numbers of factor or decreasing numbers of factor. We need to make a number of tries and errors. Before I go for trial and error, I want to show you a simplest way how we can proceed. So now you understood how you can make sure your calculation is appropriate or not appropriate. Or your calculation needs more trial and error. In that case, you now have understanding from this step.
So, if you want to know the factor 1 until factor 5 how they contributed. How is their contribution? Because factor 1 until factor 5 is representing one of the sources So, each of the sources, how much contribution they are making? That we can see from here ‘factor contribution’. Just click here and then, let say for run 2 and factor 1 until factor 5 you can see the time series of their contribution. And if you want to see by variables or by factor, here you can see the red colour. Factor 1 is contributed 79% of the mass is coming from factor 1.
The factor 1 representing air particular source and exactly same way factor 2 and factor 3 how they contributed by percentage, we can see from this pie plot. And data also you can present on this time series. So this is how you can see the step from EPA PMF from beginning until end and there is also a button name ‘diagnostics’. If you click on the ‘diagnostics’ button, this file is already saved in your folder. The folder you are already made. and then in diagnostics file, you will see all information from EPA PMF. The result you obtained that all will be in the diagnostics file. So here you can see your profile file and everything. Just scroll down.
I hope you now have at least an understanding on how EPA PMF works. If you want to make it better or more appropriate calculation by using EPA PMF, what you need to do? As I always told you, you need to go back and change the number of factors either you need to increase or decrease and see the result here by clicking on ‘obs/predicted’ concentration R square value. How is your R square value for each of the variables? The regression value here, is it well correlated or poorly correlated. If you see most of the variables is well correlated I would say your result is more accurate compared with if you change your number of factor. So, thank you everyone.
Good luck for your EPA PMF calculation.

The positive matrix factorization model (PMF) is a robust chemometrics model. PMF is the most widely used method for source resolution in Chemometric particularly apportionment of airborne particulate matter. The results of source apportionment for environmental variables from PMF are appropriate with very low uncertainty.

Theoretically, PMF produces non-negative distributions (factors) in pattern recognition. It introduces the constraint of non-negativity of all the factor matrices in pattern recognition to obtain physically meaningful solutions. This model does not require the source information of the variables.

This article is from the free online

Chemometrics in Air Pollution

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education