Hi and welcome to the third week of the course, Process Mining with ProM. In this lecture, we will discuss conformance checking using alignments. So in the process mining spectrum, this week we will focus on both conformance checking and extension techniques. And in this lecture, we will focus and discuss the key conformance checking technique. In week 2 we’ve also shown you the four quality forces and conformance checking is actually the replay fitness quality force. So relating the event data to the process model and explaining how well the process model describes the observed events.
So how do you align traces with the process model? Well, we do this step by step. We create an alignment by choosing which steps in the trace and the process model to take. So let’s start. How can we align this trace and this process model? First, of course, we start with a token in the process model, the initial marking, and we add an arrow or a pointer in the trace where we are. So currently we are at the start of the trace and the next event in the trace is event a. And the next activity, the process model is also activity a. So these match very well.
So in our alignment, we can note down that in the trace we move on activity a, and we move the pointer one step further. And in the process model we also know that we fired activity a, and we update the marking of the process model. And in this case, three tokens are produced. Now, the process model now allows for activities b, c, or d to be executed. We also look at the trace and activity b is enabled or is the next event that is observed. Well, this again matches. So we move in the trace on activity b and also in the process model, and we’ve arrived at a new state.
The trace now says that activity c was observed, and we can follow this in the process model. And, therefore, we repeatedly do this, also indeed for activity d, we fire the necessary transitions. Now in the process model e and f can be executed. The trace has an observed activity of e. This is possible. So we write it down. We update our pointer in the trace and the process model. And then finally, we can execute activity g. And now the pointer in the trace has reached the end of the trace. The process model is in the final marking, and, therefore, our alignment is complete. But, of course, this is a perfect example where everything fits.
So let’s see what type of deviations can occur. So we want to capture deviations. So I changed the trace a little bit. I removed activity d. So let’s see how we can replay it? So we put a pointer on the trace, token in the Petri net, and let’s build our alignment. Activity a, we can fire this both in the trace and in the model, we call this a synchronous move. The trace and the process model can fire synchronously activity a, and, therefore, we call this a synchronous move. Next, we can do activity b in both the trace and the process model, an other synchronous move. And we can fire activity c.
Now we reach a state where the process model must execute activity d. However, the trace did not observe activity d but activity e. So what we can do is we can write down that we did activity d in the process model, and we update our process model, but that we did not do anything in the trace. And we indicate this, for instance, with two greater than signs. So the process model now reached the new state, and we can continue the alignment with purely synchronous moves on activity e and activity g.
The move of d in the model only and not in the trace is called a move on model only. We only moved the process model but the pointer in the trace kept steady. We did not change anything in the trace. Of course, I hope you can see that also the opposite is possible and move on trace only. So let’s look at an example. Again, I have a slightly different trace with the same process model. So let’s build the alignment, activity a, no problem.
We can fire c. We can fire d. And now the process model allows for a choice between activity e or f. Well, in the trace, we have observed activity e. So let’s fire e in both the process model and the trace.
However, now the process model expects activity g while the trace has observed activity f. So in the trace, we move one step further. In the process model we stay in the same state. And now we can correctly execute activity g. And we have reached the end of the trace and the process model. So this is a type of error that we call a move on log only. We only moved in the trace. We stayed in the same state in the process model, and then we could continue with synchronous moves. So these two type of deviations are punished. And we calculate costs for this.
However, it’s not always that obvious which explanation in the process model to take for a particular trace. I’ll show you an example. So let’s take this trace, abcdeg and this process model. So let’s find an optimal alignment. We start with a token in the process model, and we fire activity a. Now in this process model, we can take one of three branches. We can take the bdcf branch. We can also take the second branch, cbdf or the last branch, cbde. In the trace, we have already moved past activity a. So there we expect activity b. And in the process model, we can also execute activity b. So let’s create an alignment where we choose the top branch.
The full alignment looks like this. And I keep the state and the process model as is now. And as you can see, when you look at the top row in the alignment on the trace, and you remove all the move on models only then you get exactly the trace back. And if you do that similarly if you look on the bottom row, and only look at activities, you get a valid path through the process model. If we calculate a cost punishment of 1 for each log move and each model move then this alignment has the cost of four. Well, we had two other options. So let’s investigate what would happen when we take the second branch.
So in the trace after a we did not see a c, but maybe we can correct for this. And then we get this alignment. So we get a move on model on c then we can synchronously move on b. Then in the log we have a log move on c. Next we can synchronously move d, then however, f is out of sync and e. So we have a move on log only on e, move a model only on f, and then we can synchronously move g again. However, this alignment also has a cost of 4. So it’s not better than the previous one. Now let’s look at the final choice that we could do.
Let’s look what would happen if we take the bottom branch. So, again, we have to have a model move and a log move on c. But d,e, and g are all synchronous. And since this only has two incorrect moves, the cost is two. The last alignment is actually the best one of the three. It incurs the least cost, the least deviations. However, you cannot decide this after executing a. The naive approach would be to take activity b after observing a, but that would lead to a non optimal alignment. There is a better alignment with lower cost. Therefore, calculating alignments is not a trivial task.
You cannot do it on the local, you cannot do it on this instance, you have to look forward and sometimes drag back and pick other options. However, the technique in ProM guarantees to always return an optimal alignment. In case there are multiple, it returns one, but it will never return the top two alignments always the bottom one. The next step, once you have these alignments is to calculate the replay fitness. You want to assign a value. We have to normalize the cost of an alignment to a value between 0 and 1 where 1 is perfect and 0 is the worst alignment. Well, actually, what is actually the worst alignment? The worst alignment would be an alignment without any synchronous moves.
So if you take the log move cost for the trace, in this example, 6, and the move on model cost of this model which is also 6 then you get a total cost of 12 if you did not allow for any synchronous move. We can use this to normalize. So for the top alignment, with a cost 4. The replay fitness value is actually 1 minus the cost normalized to the overall cost. So in this example, 1 minus 4 divided by 12, which is 1 minus one third which is actually two third. Similarly, for the second trace, you also have a replay fitness of 2/3.
And for the last trace, you have a replay fitness of 1 minus 2 divided by 12, which is 0.83. So the replay fitness of the last alignment gives the real replay fitness of this trace on this model. And, of course, when you have multiple traces, you calculate this for all traces individually and then you aggregate for the full event log. So let’s summarize the strong points of alignments. Alignments explore multiple options to find the optimal alignment. So whatever the alignment algorithm returns, you know this is the best alignment possible. In the case that there are multiple optimal alignments, it returns an arbitrary one. Secondly, it allows flexible cost.
So in the previous slides we assumed a move on log cost and a move on model cost of 1. But you could say that a move on model, on e, can be preferred or incurred lower cost than a log move on f, for instance. And that allows you to influence the explanations for deviating behavior. Finally, it relates very explicitly the event log or actually every event in the event log to an activity in the process model. And hence enables further analysis as we will see later in this week. Therefore, alignments are the key technique to use when you want to calculate replay fitness and conformance between an event log and a process model.
In the next lecture, I will show you how you can apply conformance checking and alignments calculation in ProM, and afterwards, we will show you what you can do when you have these alignments. I hope to see you again next time.