Process Mining for Python (pm4py)

A clean and simple python library for process mining.


Disclaimer:
We aim to leave the stable features as-is, i.e. they are relatively stable and are unlikely to be refactored. The experimental features are more likely to be refactored in the nearby future.

For Windows users: a version of python 3.6.x is required! Please read the Installation Section for more details.

Stable Features

Importing

Exporting

Process Discovery

Replay/Conformance Checking techniques

Process model quality evaluation


Installation

pm4py has been tested successfully on Windows (64/32 bit) and different Linux environments. Note that, to be able to use some internally used python libraries, pm4py runs with a 3.6.x version of Python. Some guidance on how to get an environment ready for Windows (64/32 bit) and Linux is provided in this tutorial. We start with the environment setup of windows (containing one part that applies to both 32 and 64 bit architectures, and a separate architecture dependant part). Subsequently, we discuss environment setup for Linux.

Windows (64/32 bit)

For some dependencies of the project, i.e. ciso8601 and cvxopt, you need to have a working Windows C/C++ compiler. We suggest to install the Microsoft Visual Studio 2017 compiler, that available for free through the following link:

Microsoft Visual Studio

During installation, it is vital to install all C++ related development tools. Note that this leads to a big download, i.e. around 6GB.

Moreover, for the purpose of visualization with pm4py, GraphViz installation is required. To install GraphViz, the installer on this page could be used:

Graphviz

After installation, GraphViz is located in the program files directory (on 64 bit systems, it is likely to be located on the x86 program files folder). The bin\ subfolder of the GraphViz directory, e.g. found in C:\Program Files (x86)\Graphviz2.38\bin, needs to be added manually to the system path. For Windows 10, we suggest the following article in order to have an explanation of how to add a new folder to the system path:

Adding a folder to the system path

Windows (64 bit)

We suggest installing Miniconda (Python 3.6.x) as Python distribution. Miniconda is a Python distribution focused on data science and machine learning related applications, and could be retrieved using the following link:

Miniconda with Python 3.6.x

During the installation of the Miniconda distribution, it is important to select the addition of Miniconda Python to the system path option. This allows us to get easy access to Python from the command line / Powershell.

pm4py requires additional packages to work. In order to install them, it is necessary to open a command line / Powershell and browse to the folder that contains the pm4py sources. An example of the instruction to reach a specific folder in a Windows environment is the following (the path should be replaced):

C:\>cd C:\Users\johndoe\pm4py-source

The additional packages of the project are installed by issuing the following command:

pip install -r requirements.txt

Windows (32 bit)

Due to the limitations of a 32 bit architecture (limited memory possibilities for threads/processes), it is suggested to use a Windows (64 bit) installation whenever possible.

A full Anaconda (Python 3.6.x) installation is suggested. Anaconda is a free and open source distribution of the Python programming language for data science and machine learning related applications. Use the following link to get Anaconda:

https://www.anaconda.com/download/

Then, install Anaconda. An important option in order to get easy access to Python from the command line / Powershell is setting the addition of Anaconda Python to the system path. pm4py requires additional packages to work. In order to install them, it is necessary to open a command line / Powershell and reach the folder that contains pm4py sources. An example of the instruction to reach a specific folder in a Windows environment is the following (the path should be replaced):

C:\>cd C:\Users\johndoe\pm4py-source

The following command is required to install cvxopt (a linear/integer programming solver, for further information see this link). Here the conda package manager provided by Anaconda is used:

conda install cvxopt

Other requirements that are needed to execute pm4py could be then installed through the following command:

pip install -r requirements.txt

Linux

In order to be able to use pm4py under Linux, the presence of a C/C++ compiler is required. Most distributions default install include already the gcc and g++ compilers, respectively compiling C and C++ code.

In order to check the presence of gcc and g++ on your current distribution, along with their version, the following commands could be given:

gcc -v
g++ -v

The response is a complex text that contains, in the end, the version of the compiler that is currently installed:

Thread model: posix
gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)

If they are not installed, refer to your distribution support for instructions on how to install them. We provide some commands for the most widely used distributions:

Debian / Ubuntu**

apt-get install gcc gcc-g++

Fedora

yum install gcc gcc-c++

Moreover, the presence of GraphViz is required on the system. To check the presence of GraphViz, please give the following command:

dot -h

If GraphViz is not installed, you will get an error from the output of that program. To install GraphViz, a command depending on the distribution should be given. We provide some commands for the most widely used distributions:

Debian / Ubuntu

apt-get install graphviz

Fedora

yum install graphviz

Also for Linux, Python 3.6.x is required. Since the stable versions of the Linux distributions generally include previous versions of Python 3.x, it is suggested to install the Miniconda Python distribution for Linux. Miniconda is a Python distribution focused on data science and machine learning related applications, and could be retrieved using the following link:

https://conda.io/miniconda.html

The 64 bit installer (32 bit architecture has limited memory allocation for threads/processes) could be executed from the command line using the following instruction:

root@debian:~# bash Miniconda3-latest-Linux-x86_64.sh

Important note: it is not required to be root; any user could suffice.

As the first step in the installation of Miniconda, it is required to read the license agreements. Press Enter key in order to read them, move with up and down arrow keys in order to read the points, and then click q in order to quit the license agreements and accept/deny (yes/no) that. Then, a path for the installation of Miniconda is required. The proposed path, that is inside the user directory, is proposed and could be accepted as-is. Then, Miniconda asks if the user wants to add executables to the user path, it is convenient to say yes here.

The additional requirements needed to run pm4py could be installed reaching the directory of pm4py and using pip

root@debian:~# cd /root/pm4py-source/
root@debian:~/pm4py-source# pip install -r requirements.txt

Testing pm4py after installation

To test if everything is installed correctly, the following script could be run from the pm4py folder

python imdf_example.py

On Linux environments, it could be necessary to call the proper version of Python replacing python with python3.6

The scripts work in the following way:

  1. The data/input_data/running-example.xes XES log file is loaded
  2. A version of the Inductive Miner process discovery algorithm is applied to retrieve a sound workflow net describing the process, along with an initial and final marking
  3. The single cases in the log are aligned with the process model.

If no error occurs during the execution of the script, then everything is installed correctly.


Working with Event Data

Event data, usually recorded in so-called event logs, are the primary source of data of any process mining project and/or algorithm. As such, they play a vital role in process mining.

Within pm4py we distinguish between two major event data object types:

In the remainder of this section, we describe how pm4py supports access and manipulation of event log data through the IEEE XES- and csv format.

Importing IEEE XES files

IEEE XES is a standard format in which Process Mining logs are expressed. For further information about the format, please study the IEEE Website.

The following example code aims to import a log, given a path to a log file.

from pm4py.objects.log.importer.xes import factory as xes_import_factory
log = xes_import_factory.apply("<path_to_xes_file>")

A fully working version of the example script can be found in the pm4py-source project, in the examples/web/import_xes_log.py file.

The IEEE XES log is imported as a trace log, hence, the events are already grouped in traces. Trace logs are stored as an extension of the Python list: to access a given trace in the log, it is enough to provide its index in the log. Consider the following examples of how to access the different objects stored in the imported trace log:

The apply method of the xes_import_factory, i.e. located in pm4py.objects.log.importer.xes.factory file, contains two additional parameters:

Observe that throughout pm4py, we often use the notions of factories, which contain an apply method that takes some objects as input, an optional parameters object and an optional variant object. The parameters object is always a dictionary that contains the parameters in a key-value fashion. The variant is typically a strng-valued argument.

Currently, we support two different variants, with corresponding (different) parameters:

It is possible to access a specific value of a trace / event attribute, as example (the “concept:name”-key at trace level represents the case ID, at event level it typically represents the performed activity):

first_trace_concept_name = log[0].attributes["concept:name"]
first_event_first_trace_concept_name = log[0][0]["concept:name"]

The following code iterates over all the traces in the log writing the case id and, for each event, the performed activity:

for case_index, case in enumerate(log):
    print("\n case index: %d  case id: %s" % (case_index, case.attributes["concept:name"]))
    for event_index, event in enumerate(case):
        print("event index: %d  event activity: %s" % (event_index, event["concept:name"]))
 case index: 4  case id: 5
event index: 0  event activity: register request
event index: 1  event activity: examine casually
event index: 2  event activity: check ticket
event index: 3  event activity: decide
event index: 4  event activity: reinitiate request
event index: 5  event activity: check ticket
event index: 6  event activity: examine casually
event index: 7  event activity: decide
event index: 8  event activity: reinitiate request
event index: 9  event activity: examine casually
event index: 10  event activity: check ticket
event index: 11  event activity: decide
event index: 12  event activity: reject request

An example of invoking the non-standard variant, along with the specification of the timestamp_sort parameter, is contained in the following code:

import os
from pm4py.objects.log.importer.xes import factory as xes_import_factory

parameters = {"timestamp_sort": True}

log = xes_import_factory.apply("<path_to_xes_file>", variant="nonstandard", parameters=parameters)

Exporting IEEE XES files

Exporting takes as input a trace log and produces an XML that is eventually being saved into a file.

To export a trace log into a file exportedLog.xes, the following code could be used:

from pm4py.objects.log.exporter.xes import factory as xes_exporter

xes_exporter.export_log(log, "exportedLog.xes")

Importing logs from CSV files

CSV is a tabular format often used to store event logs. Excluding the first row, which describes the headers, each row in the CSV file corresponds to an event. Events in a CSV are not grouped in traces: a grouping should be made specifying a column as case ID and then events that share the same column value are grouped in the same case.

Process Mining algorithms implemented in pm4py usually take a trace log as input. The logical steps in order to get a trace log from a CSV file are:

In the following piece of code, the CSV file running-example.csv that can be found in the directory tests/input_data is imported into an event log structure:

from pm4py.objects.log.importer.csv import factory as csv_importer

event_log = csv_importer.import_log("tests\\input_data\\running-example.csv")

The previous code covers both the importing of the CSV through Pandas and its conversion into the event log structure. Additional parameters for the import_log method inside a dictionary passed as optional parameters argument:

In an event log structure, events are not grouped in cases, so retrieving the length of an event log means retrieving the number of events. Moreover, each event in an event log is saved as a dictionary where the keys are the column names:

event_log_length = len(event_log)
print(event_log_length)
for event in event_log:
    print(event)

In particular, this is an event of the running-example.csv log:

{'Unnamed: 0': 10, 'Activity': 'check ticket', 'Costs': 100, 'Resource': 'Mike', 'case:concept:name': 2, 'case:creator': 'Fluxicon Nitro', 'concept:name': 'check ticket', 'org:resource': 'Mike', 'time:timestamp': Timestamp('2010-12-30 11:12:00')}

To eventually convert the event log structure into a trace log structure (where events are grouped in cases), the case ID column must be identified by the user (in the previous example, the case ID column is caseconceptname). To operate the conversion, the following instructions could be provided:

from pm4py.objects.log import transform

trace_log = transform.transform_event_log_to_trace_log(event_log, case_glue="case:concept:name")

Sometimes is useful to ingest the CSV into a dataframe using Pandas, operating some pre-filtering on the dataframe, and after that converting it into an event log (and then trace log) structure. The following code covers the ingestion, the conversion into event log structure and eventually the conversion into trace log.

from pm4py.objects.log.adapters.pandas import csv_import_adapter
from pm4py.objects.log.importer.csv.versions import pandas_df_imp
from pm4py.objects.log import transform

dataframe = csv_import_adapter.import_dataframe_from_path("tests\\input_data\\running-example.csv", sep=",")
event_log = pandas_df_imp.convert_dataframe_to_event_log(dataframe)
trace_log = transform.transform_event_log_to_trace_log(event_log, case_glue="case:concept:name")

Exporting logs to CSV files

Exporting capabilities into CSV files are provided for both event log and trace log formats.

The following example covers exporting of event logs into CSV. Hereby, the event log structure is converted into a Pandas dataframe, which is then exported to a CSV file:

from pm4py.objects.log.exporter.csv import factory as csv_exporter

csv_exporter.export_log(event_log, "outputFile1.csv")

The exporting of trace logs into CSV is a similar matter. The trace log is converted into an event log (the case attributes are reported into the event adding the case: prefix to them), then the event log structure is converted into a Pandas dataframe and the dataframe is exported to a CSV file

from pm4py.objects.log.exporter.csv import factory as csv_exporter

csv_exporter.export_log(trace_log, "outputFile2.csv")

Petri net management

Petri nets are one of the most common formalism to express a process model. A Petri net is a directed bipartite graph, in which the nodes represent transitions and places. Arcs are connecting places to transitions and transitions to places, and have an associated weight. A transition can fire if each of its input places contains a number of tokens that is at least equal to the weight of the arc connecting the place to the transition. When a transition is fired, then tokens are removed from the input places according to the weight of the input arc, and are added to the output places according to the weight of the output arc.

A marking is a state in the Petri net that associates each place to a number of tokens and is uniquely associated to a set of enabled transitions that could be fired according to the marking.

Process Discovery algorithms implemented in pm4py returns a Petri net along with an initial marking and a final marking. An initial marking is the initial state of execution of a process, a final marking is a state that should be reached at the end of the execution of the process.

Importing and exporting

Petri nets, along with their initial and final marking, can be imported/exported from the PNML file format. The following code can be used to import a Petri net along with the initial and final marking. In particular, the Petri net related to running-example process is loaded from the test folder:

import os
from pm4py.objects.petri.importer import pnml as pnml_importer

net, initial_marking, final_marking = pnml_importer.import_net(os.path.join("tests","input_data","running-example.pnml"))

The Petri net is visualized using the Petri net visualizer:

from pm4py.visualization.petrinet import factory as pn_vis_factory

gviz = pn_vis_factory.apply(net, initial_marking, final_marking)
pn_vis_factory.view(gviz)

A Petri net can be exported along with only its initial marking:

from pm4py.objects.petri.exporter import pnml as pnml_exporter

pnml_exporter.export_net(net, initial_marking, "petri.pnml")

And along with both its initial marking and final marking:

pnml_exporter.export_net(net, initial_marking, "petri_final.pnml", final_marking=final_marking)

Petri Net properties

The list of transitions enabled in a particular marking can be obtained using the following code:

from pm4py.objects.petri import semantics

transitions = semantics.enabled_transitions(net, initial_marking)

The function print(transitions) reports that only the transition register request is enabled in the initial marking in the given Petri net. To obtained all places, transitions, and arcs of the Petri net, the following code can be used:

places = net.places
transitions = net.transitions
arcs = net.arcs

Each place has a name and a set of input/output arcs (connected at source/target to a transition). Each transition has a name and a label and a set of input/output arcs (connected at source/target to a place). The following code prints for each place the name, and for each input arc of the place the name and the label of the corresponding transition:

for place in places:
    print("\nPLACE: "+place.name)

    for arc in place.in_arcs:
        print(arc.source.name, arc.source.label)

The output starts with the following:

PLACE: sink 47
n10 register request
n16 reinitiate request

PLACE: source 45
...

Similarly, the following code prints for each transition the name and the label, and for each output arc of the transition the name of the corresponding place:

for trans in transitions:
    print("\nTRANS: ",trans.name, trans.label)

    for arc in trans.out_arcs:
        print(arc.target.name)

For the running example the output starts with the following:

TRANS:  n14 examine thoroughly
sink 54

TRANS:  n15 decide
middle 49
...

Creating a new Petri net

In this section, an overview of the code necessary to create a new Petri net with places, transitions, and arcs is provided. A Petri net object in pm4py should be created with a name. For example, this creates a Petri net with name new_petri_net

# creating an empty Petri net
net = PetriNet("new_petri_net")

Also places need to be named upon their creation:

# creating source, p_1 and sink place
source = PetriNet.Place("source")
sink = PetriNet.Place("sink")
p_1 = PetriNet.Place("p_1")

To be part of the Petri net they are added to it:

net.places.add(source)
net.places.add(sink)
net.places.add(p_1)

Similar to the places, transitions can be created. However, they need to be assigned a name and a label:

t_1 = PetriNet.Transition("name_1", "label_1")
t_2 = PetriNet.Transition("name_2", "label_2")

They should also be added to the Petri net:

net.transitions.add(t_1)
net.transitions.add(t_2)

The following code is useful to add arcs in the Petri net. Arcs can go from place to transition or from transition to place. The first parameter specifies the starting point of the arc, the second parameter its target and the last parameter states the Petri net it belongs to.

from pm4py.objects.petri import utils

utils.add_arc_from_to(source, t_1, net)
utils.add_arc_from_to(t_1, p_1, net)
utils.add_arc_from_to(p_1, t_2, net)
utils.add_arc_from_to(t_2, sink, net)

To complete the Petri net an initial and possibly a final marking need to be defined. In the following, we define the initial marking to contain 1 token in the source place and the final marking to contain 1 token in the sink place:

from pm4py.objects.petri.petrinet import Marking

initial_marking = Marking()
initial_marking[source] = 1
final_marking = Marking()
final_marking[sink] = 1

The resulting Petri net along with the initial and final marking could be exported:

from pm4py.objects.petri.exporter import pnml as pnml_exporter

pnml_exporter.export_net(net, initial_marking, "createdPetriNet1.pnml", final_marking=final_marking)

Or visualized:

from pm4py.visualization.petrinet import factory as pn_vis_factory

gviz = pn_vis_factory.apply(net, initial_marking, final_marking)
pn_vis_factory.view(gviz)

Created Petri net

To obtain a specific output format (e.g. svg or png) a format parameter should be provided to the algorithm. The following code explains how to obtain an SVG representation of the Petri net:

from pm4py.visualization.petrinet import factory as pn_vis_factory

parameters = {"format":"svg"}
gviz = pn_vis_factory.apply(net, initial_marking, final_marking, parameters=parameters)
pn_vis_factory.view(gviz)

Instead of opening visualization of the model directly it can also be saved using the following code:

from pm4py.visualization.petrinet import factory as pn_vis_factory

parameters = {"format":"svg"}
gviz = pn_vis_factory.apply(net, initial_marking, final_marking, parameters=parameters)
pn_vis_factory.save(gviz, "alpha.svg")

Process Discovery

Discovery Algorithms

Process Discovery using the Alpha Algorithm

Process Discovery algorithms want to find a suitable process model that describes the order of events/activities that are executed during a process execution. The Alpha Algorithm is one of the most known Process Discovery algorithm and is able to find:

We provide an example where a log is read, the Alpha algorithm is applied and the Petri net along with the initial and the final marking are found. The log we take as input is the running-example.xes XES log that can be found in the folder tests/input_data.

The following code imports the running-example.xes log:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))

Then, the log is loaded in memory and the Alpha Miner algorithm can be applied:

from pm4py.algo.discovery.alpha import factory as alpha_miner

net, initial_marking, final_marking = alpha_miner.apply(log)

To export the process model, to visualize it or to save the visualization of the model, the functions presented in the Petri net management section can be used.

The following picture represents the Petri net mined from the running-example.xes log by applying the Alpha Miner:

Alpha Petri net

The place colored green is the source place and belongs to the initial marking. In the initial marking, a token is assigned to that place (indicated by the number 1 on the place). The place colored orange is the sink place and belongs to the final marking. We see that transitions here correspond to activities in the log. Models extracted by the Alpha Miner often have deadlock problems, so it is not sure that each trace is replayable on this model.

Process Discovery using Inductive Miner

Mining a Petri net

The Inductive Miner is a Process Discovery algorithm that aims to construct a sound workflow net with fitness guarantees: it is assured by construction that every trace in the log is replayable on this model. The basic idea of Inductive Miner is about detecting a “cut” in the log (e.g. sequential cut, parallel cut, concurrent cut and loop cut) and then recurs on sublogs, which were found applying the cut, until a base case is found. In pm4py, a variant of the Inductive Miner is implemented (IMDF; for further details see this link) that avoids the recursion on the sublogs but uses the Directly Follows graph.

Models generated by the Inductive Miner have generally greater fitness and generalization compared to models extracted by Alpha Miner. Inductive Miner models usually make extensive use of hidden transitions, especially for skipping/looping on a portion on the model. Furthermore, each visible transition has a unique label (there are no transitions in the model that share the same label).

We provide an example where a log is read, the Inductive Miner is applied and the Petri net along with the initial and the final marking are found. The log we take as input is the running-example.xes XES log that can be found in the folder tests/input_data.

To read the running-example.xes log, the following Python code can be used:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))

Then, the log is loaded in memory and the Inductive Miner algorithm is applied:

from pm4py.algo.discovery.inductive import factory as inductive_miner

net, initial_marking, final_marking = inductive_miner.apply(log)

To export the process model, to visualize it or to save the visualization of the model, the functions presented in the Petri net management section can be used.

The following picture represents the Petri net obtained on running-example.xes log by applying Inductive Miner:

Inductive Miner Petri net

The place colored green is the source place and belongs to the initial marking. In the initial marking, a token is assigned to that place (indicated by the number 1 on the place). The place colored orange is the sink place and belongs to the final marking. We see that visible transitions here correspond to activities in the log, and there are some hidden transitions.

Mining a process tree

It is also possible to obtain a process tree from the event log using the Inductive Miner. The following code can be used in order to mine a process tree from an event log:

from pm4py.algo.discovery.inductive import factory as inductive_miner

tree = inductive_miner.apply_tree(log)

from pm4py.visualization.process_tree import factory as pt_vis_factory

gviz = pt_vis_factory.apply(tree)
pt_vis_factory.view(gviz)

The following representation is obtained:

Process Tree

If needed, the process tree could be printed through print(tree).

It is also possible to convert a process tree to a Petri net:

from pm4py.objects.conversion.tree_to_petri import factory as tree_petri_converter

net, initial_marking, final_marking = tree_petri_converter.apply(tree)

Process Discovery using Directly-Follows Graphs

Process models modeled using Petri nets have a well-defined semantic: a process execution starts from the places included in the initial marking and finishes at the places included in the final marking. In this section, another class of process models, Directly-Follows Graphs, are introduced. Directly-Follows graphs are graphs where the nodes represent the events/activities in the log and directed edges are present between nodes if there is at least a trace in the log where the source event/activity is followed by the target event/activity. On top of these directed edges, it is easy to represent metrics like frequency (counting the number of times the source event/activity is followed by the target event/activity) and performance (some aggregation, for example, the mean, of time inter-lapsed between the two events/activities).

We extract a Directly-Follows graph from the log running-example.xes.

To read the running-example.xes log, the following Python code could be used:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))

Then, the following code could be used to extract a Directly-Follows graph from the log:

from pm4py.algo.discovery.dfg import factory as dfg_factory

dfg = dfg_factory.apply(log)

A colored visualization of the Directly-Follows graph decorated with the frequency of activities and edges can be then obtained by using the following code:

from pm4py.visualization.dfg import factory as dfg_vis_factory

gviz = dfg_vis_factory.apply(dfg, log=log, variant="frequency")
dfg_vis_factory.view(gviz)

DFG with frequency

To get a Directly-Follows graph decorated with the performance between the edges, the following code can replace the previous two pieces of code. The specification of performance should be included in both the Directly-Follows application and the visualization part:

from pm4py.algo.discovery.dfg import factory as dfg_factory
from pm4py.visualization.dfg import factory as dfg_vis_factory

dfg = dfg_factory.apply(log, variant="performance")
gviz = dfg_vis_factory.apply(dfg, log=log, variant="performance")
dfg_vis_factory.view(gviz)

DFG with performance

To save the DFG decorated with frequency or performance, instead of displaying it on screen, in svg format, the following code could be used:

from pm4py.algo.discovery.dfg import factory as dfg_factory
from pm4py.visualization.dfg import factory as dfg_vis_factory

dfg = dfg_factory.apply(log, variant="performance")
parameters = {"format":"svg"}
gviz = dfg_vis_factory.apply(dfg, log=log, variant="performance", parameters=parameters)
dfg_vis_factory.save(gviz, "dfg.svg")

Adding frequency or performance information to Petri nets

Similar to the Directly-Follows graph, it is also possible to decorate the Petri net with frequency or performance information. This is done by using a replay technique on the model and then assigning frequency/performance to the paths. The variant parameter of the factory specifies which annotation should be used. The values for the variant parameter are the following:

In the case the frequency and performance decoration are chosen, it is required to pass the log as a parameter of the visualization (it needs to be replayed).

The following code can be used to obtain the Petri net mined by the Inductive Miner decorated with frequency information:

from pm4py.visualization.petrinet import factory as pn_vis_factory

parameters = {"format":"png"}
gviz = pn_vis_factory.apply(net, initial_marking, final_marking, parameters=parameters, variant="frequency", log=log)
pn_vis_factory.save(gviz, "inductive_frequency.png")

Inductive Miner Petri net with frequency

Changing the variant to performance, we obtain the following process schema:

Inductive Miner Petri net with performance

Using different activity keys

Specifying a different activity key in a Process Mining algorithm

Algorithms implemented in pm4py assume to classify events based on their activity name, which is usually reported inside the concept:name event attribute. In some contexts, it is useful to use another event attribute as activity:

The following example, shows the specification of an activity key for the Alpha Miner algorithm:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.algo.discovery.alpha import factory as alpha_miner
from pm4py.util import constants

log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))

parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name"}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

For logs imported from XES format, a list of fields that could be used in order to classify events and apply Process Mining algorithms is usually reported in the classifiers section. The Standard classifier usually includes the activity name (the concept:name attribute) and the lifecycle (the lifecycle:transition attribute); the Event name classifier includes only the activity name.

In pm4py, it is assumed that algorithms work on a single activity key. In order to use multiple fields, a new attribute should be inserted for each event as the concatenation of the two.

Classifiers: retrieval and insertion of a corresponding attribute

The following example demonstrates the retrieval of the classifiers inside a log file, using the receipt.xes log:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.import_log(os.path.join("tests","input_data","receipt.xes"))
print(log.classifiers)

The classifiers are then printed to the screen:

{'Activity classifier': ['concept:name', 'lifecycle:transition'], 'Resource classifier': ['org:resource'], 'Group classifier': ['org:group']}

To use the classifier Activity classifier and write a new attribute for each event in the log, the following code can be used:

from pm4py.objects.log.util import insert_classifier

log, activity_key = insert_classifier.insert_activity_classifier_attribute(log, "Activity classifier")
print(activity_key)

Then, as before, the Alpha Miner can be applied on the log specifying the newly inserted activity key:

from pm4py.algo.discovery.alpha import factory as alpha_miner
from pm4py.util import constants

parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: activity_key}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

Insert manually a new attribute

In the case, the XES specifies no classifiers, and a different field should be used as activity key, there is the option to specify it manually. For example, in this piece of code we read the receipt.xes log and create a new attribute called customClassifier that is the activity name plus the transition

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.util import constants

log = xes_importer.import_log(os.path.join("tests","input_data","receipt.xes"))

for trace in log:
    for event in trace:
        event["customClassifier"] = event["concept:name"] + event["lifecycle:transition"]

Then, for example, the Alpha Miner can be applied specifying customClassifier as activity key

from pm4py.algo.discovery.alpha import factory as alpha_miner

parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "customClassifier"}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

Conformance Checking

Evaluating Petri nets

Now that is is clear how to obtain a Petri net, along with an initial and a final marking and how to apply a Process Discovery algorithm, the question is how to evaluate the quality of the extracted models in the 4 dimensions of Fitness, Precision, Generalization, and Simplicity. In pm4py we provide algorithms to evaluate these 4 dimensions.

For the examples reported in the following sections, we assume to work with the running-example logs located in the folder tests\input_data and apply the Alpha Miner as well as the Inductive Miner:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.algo.discovery.alpha import factory as alpha_miner
from pm4py.algo.discovery.inductive import factory as inductive_miner

log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))
alpha_petri, alpha_initial_marking, alpha_final_marking = alpha_miner.apply(log)
inductive_petri, inductive_initial_marking, inductive_final_marking = inductive_miner.apply(log)

Fitness

Fitness is a measure of the replayability of the traces used to mine the model. A fitness evaluation could provide:

In pm4py we provide the following algorithms to replay traces on a process model: token-based replay and alignment-based replay.

The following code is useful to get the average fitness value and the percentage of fit traces according to the token replayer:

from pm4py.evaluation.replay_fitness import factory as replay_factory

fitness_alpha = replay_factory.apply(log, alpha_petri, alpha_initial_marking, alpha_final_marking)
fitness_inductive = replay_factory.apply(log, inductive_petri, inductive_initial_marking, inductive_final_marking)
print("fitness_alpha=",fitness_alpha)
print("fitness_inductive=",fitness_inductive)

The output shows that for the running-example log and both Alpha Miner and Inductive Miner we have perfect fitness:

fitness_alpha= {'percFitTraces': 100.0, 'averageFitness': 1.0}
fitness_inductive= {'percFitTraces': 100.0, 'averageFitness': 1.0}

To use the alignment-based replay and get the fitness values, the following code can be used on the Inductive model. Since Alpha Miner does not produce a sound workflow net, alignments based replay cannot be applied.

fitness_inductive = replay_factory.apply(log, inductive_petri, inductive_initial_marking, inductive_final_marking, variant="alignments")

Alignments are using multiprocessing in order to improve the performance, therefore, it is mandatory to start the script with this condition in order to compute alignments:

if __name__ == "__main__":

Precision

Precision is a comparison between the behavior activated in the model at a given state and the behavior activated in the log. A model is precise when it does not allow for paths that are not present in the log. An approach to measure precision has been proposed in the following paper and is called ETConformance:

Muñoz-Gama, Jorge, and Josep Carmona. “A fresh look at precision in process conformance.” International Conference on Business Process Management. Springer, Berlin, Heidelberg, 2010.

Basically, the idea is to build an automaton from the log where the states are represented by prefixes of the traces in the log and transitions are inserted in the automaton if they are present in some trace of the log.

Each state of the automaton is replayed in the Petri net (assuming that it is fit according to the Petri net) and then we have:

A set of escaping edges is defined as difference between the activated transitions and the reflected tasks. The following sums are computed:

The precision measure then could be valued as 1 - SUM_EE/SUM_AT.

The following code measures the precision of the Alpha and Inductive Miner models on the receipt.xes log:

from pm4py.evaluation.precision import factory as precision_factory

precision_alpha = precision_factory.apply(log, alpha_petri, alpha_initial_marking, alpha_final_marking)
precision_inductive = precision_factory.apply(log, inductive_petri, inductive_initial_marking, inductive_final_marking)

print("precision_alpha=",precision_alpha)
print("precision_inductive=",precision_inductive)

We obtain the following values:

precision_alpha= 0.10416666666666663
precision_inductive= 0.10416666666666663

The Inductive Miner model is in this case less precise than the Alpha Model, as the model is a Spaghetti model and to fit the model a lot of skip/loop transitions are added.

Generalization

Generalization indicates the characteristic of a process model to not host components that are too specific and are used only in few executions of the process. Models that overfit the log have generally a lot of components that are too specific.

In the context of measuring precision on a Petri net, the components that have been taken into account for measuring precision are the transitions (both visible and hidden). In particular, the token replayer returns for each trace the list of transitions that have been activated during the replay. Note, that the implementation provided in pm4py is able to take into account hidden transitions. So it is easy to measure how many times given transitions have been activated during the replay of the log.

The implemented approach is suggested in the paper:

Buijs, Joos CAM, Boudewijn F. van Dongen, and Wil MP van der Aalst. “Quality dimensions in process discovery: The importance of fitness, precision, generalization and simplicity.” International Journal of Cooperative Information Systems 23.01 (2014): 1440001.

Accordingly, generalization is obtained using the following formula on the Petri net:

Generalization formula

The following code measures the generalization of the Alpha and Inductive Miner models on the receipt.xes log:

from pm4py.evaluation.generalization import factory as generalization_factory

generalization_alpha = generalization_factory.apply(log, alpha_petri, alpha_initial_marking, alpha_final_marking)
generalization_inductive = generalization_factory.apply(log, inductive_petri, inductive_initial_marking, inductive_final_marking)

print("generalization_alpha=",generalization_alpha)
print("generalization_inductive=",generalization_inductive)

We obtain the following values:

generalization_alpha= 0.5259294594558881
generalization_inductive= 0.4158076884525792

The generalization value provided by the Inductive Miner on this log is slightly lower than the generalization of the Alpha Miner model because of the presence of skip/loop transitions that are visited less often than visible transitions. In comparison, the Petri net constructed by the Alpha Miner only contains visible transitions.

Simplicity

A model is simple when the end user can really understand information from the process model, so the execution paths of the model are clear. For Petri nets, the execution semantics is related to firing transitions, removing tokens from their input places and adding tokens to output places. So a model could be seen as simpler when the number of transitions (a possible way to consume/insert tokens) is low in comparison to the number of places. The approach implemented in pm4py is inspired by this idea, which has been reported in the following paper and is called ‘inverse arc degree’:

Blum, Fabian Rojas. Metrics in process discovery. Technical Report TR/DCC-2015-6, Computer Science Department, University of Chile, 2015.

The formula applied for simplicity is the following:

Simplicity formula

The following code measures the simplicity of the Alpha and Inductive Miner models on the receipt.xes log:

from pm4py.evaluation.simplicity import factory as simplicity_factory

simplicity_alpha = simplicity_factory.apply(alpha_petri)
simplicity_inductive = simplicity_factory.apply(inductive_petri)

print("simplicity_alpha=",simplicity_alpha)
print("simplicity_inductive=",simplicity_inductive)

We obtain the following values:

simplicity_alpha= 0.5333333333333333
simplicity_inductive= 0.6956521739130435

The simplicity of the Inductive Miner model is higher than the simplicity provided by the Alpha Miner on this log.

Getting all measures in one-line

In the previous sections, methods to calculate fitness, precision, generalization and simplicity of a process model have been provided. In this section, some code to retrieve all the measures at once is provided:

from pm4py.evaluation import factory as evaluation_factory
alpha_evaluation_result = evaluation_factory.apply(log, alpha_petri, alpha_initial_marking, alpha_final_marking)
print("alpha_evaluation_result=",alpha_evaluation_result)

inductive_evaluation_result = evaluation_factory.apply(log, inductive_petri, inductive_initial_marking, inductive_final_marking)
print("inductive_evaluation_result=",inductive_evaluation_result)

We obtain the following values:

alpha_evaluation_result= {'fitness': {'percFitTraces': 100.0, 'averageFitness': 1.0}, 'precision': 0.10416666666666663, 'generalization': 0.5259294594558881, 'simplicity': 0.5333333333333333, 'metricsAverageWeight': 0.540857364863972}
inductive_evaluation_result= {'fitness': {'percFitTraces': 100.0, 'averageFitness': 1.0}, 'precision': 0.10416666666666663, 'generalization': 0.4158076884525792, 'simplicity': 0.6956521739130435, 'metricsAverageWeight': 0.5539066322580724}

These values are the same that have been reported previously, and another measure (that is the average of the 4 measures) is provided with key ‘metricsAverageWeight’. It measures the overall quality of the process model. In this case, we see that the overall quality of the model extracted by Inductive Miner is greater than the overall quality of the model extracted by Alpha Miner.

Conformance checking techniques

Token-based replayer

Token-based replay matches a trace and a Petri net model, starting from the initial place, in order to discover which transitions are executed and in which places we have remaining or missing tokens for the given process instance. Token-based replay is useful for Conformance Checking: indeed, a trace is fitting according to the model if, during its execution, the transitions can be fired without the need to insert any missing token. If the reaching of the final marking is imposed, then a trace is fitting if it reaches the final marking without any missing or remaining tokens.

Token-based replay permits both global and local Conformance Checking. For each trace, we can assign a fitness value that is between 0 and 1 and is defined as:

Token Replay formula

In pm4py there is an implementation of a token replayer that is able to go across hidden transitions (calculating shortest paths between places) and can be used with any Petri net model with unique visible transitions and hidden transitions. When a visible transition needs to be fired and not all places in the preset are provided with the correct number of tokens, starting from the current marking it is checked if for some place there is a sequence of hidden transitions that could be fired in order to enable the visible transition. The hidden transitions are then fired and a marking that permits to enable the visible transition is reached. In the following picture, an example representing the algorithm is provided:

Token Replay hidden transitions example

Visible transitions TRANS could be enabled from the current marking by firring hidden transitions ht1 and ht2.

Aside from the fitness value, the replay algorithm can be configured in order to consider a trace completely fitting even if there are remaining tokens, as long as all visible transitions corresponding to events in the trace can be fired. Moreover, it can be configured to reach the final marking through hidden transitions. This is useful when after the last activity, the final marking is not reached but could be reached with the execution of hidden transitions.

We provide the following example showing the application of token-based replay. The example starts as usual with the import of the running-example.xes log and the application of the Inductive Miner.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.algo.discovery.inductive import factory as inductive_miner

log = xes_importer.import_log(os.path.join("tests", "input_data", "running-example.xes"))
net, initial_marking, final_marking = inductive_miner.apply(log)

To apply token-based replay, the following code is applied:

from pm4py.algo.conformance.tokenreplay import factory as token_replay

replay_result = token_replay.apply(log, net, initial_marking, final_marking)

The print(replay_result) command prints the results:

[{'trace_is_fit': True, 'trace_fitness': 1.0, 'activated_transitions': [register request, examine casually, check ticket, decide, reinitiate request, loop_2, examine thoroughly, check ticket, decide, pay compensation, tau_1], 'reached_marking': ['sink:1'], 'enabled_transitions_in_marking': set(), 'transitions_with_problems': []}, {'trace_is_fit': True, 'trace_fitness': 1.0, 'activated_transitions': [register request, skip_5, check ticket, loop_4, examine casually, skip_6, decide, pay compensation, tau_1], 'reached_marking': ['sink:0'], 'enabled_transitions_in_marking': set(), 'transitions_with_problems': []}, {'trace_is_fit': True, 'trace_fitness': 1.0, 'activated_transitions': [register request, examine thoroughly, check ticket, decide, reject request, tau_1], 'reached_marking': ['sink:1'], 'enabled_transitions_in_marking': set(), 'transitions_with_problems': []}, {'trace_is_fit': True, 'trace_fitness': 1.0, 'activated_transitions': [register request, examine casually, check ticket, decide, pay compensation, tau_1], 'reached_marking': ['sink:0'], 'enabled_transitions_in_marking': set(), 'transitions_with_problems': []}, {'trace_is_fit': True, 'trace_fitness': 1.0, 'activated_transitions': [register request, examine casually, check ticket, decide, reinitiate request, loop_2, skip_5, check ticket, loop_4, examine casually, skip_6, decide, reinitiate request, loop_2, examine casually, check ticket, decide, reject request, tau_1], 'reached_marking': ['sink:0'], 'enabled_transitions_in_marking': set(), 'transitions_with_problems': []}, {'trace_is_fit': True, 'trace_fitness': 1.0, 'activated_transitions': [register request, skip_5, check ticket, loop_4, examine thoroughly, skip_6, decide, reject request, tau_1], 'reached_marking': ['sink:0'], 'enabled_transitions_in_marking': set(), 'transitions_with_problems': []}]

There is one dictionary in the list for each trace, the keys provided in the dictionary for each trace are:

The following code provides the overall log fitness value:

from pm4py.evaluation.replay_fitness import factory as replay_fitness_factory

log_fitness = replay_fitness_factory.evaluate(replay_result, variant="token_replay")

If we execute print(log_fitness) then the following result is obtained:

{'percFitTraces': 100.0, 'averageFitness': 1.0}

The token-based replayer can also be configured to return local conformance information about places. This is achieved through the enable_placeFitness parameter. The following code could be applied:

from pm4py.algo.conformance.tokenreplay import factory as token_replay

replay_result, place_fitness = token_replay.apply(log, net, initial_marking, final_marking, parameters={"enable_placeFitness": True})

If we do print(place_fitness), the following result is obtained:

{({'check ticket'}, {'decide'}): {'underfedTraces': set(), 'overfedTraces': set()}, ({'examine thoroughly', 'examine casually'}, {'decide'}): {'underfedTraces': set(), 'overfedTraces': set()}, ({'reinitiate request', 'register request'}, {'examine thoroughly', 'examine casually'}): {'underfedTraces': set(), 'overfedTraces': set()}, ({'decide'}, {'reinitiate request', 'pay compensation', 'reject request'}): {'underfedTraces': set(), 'overfedTraces': set()}, start: {'underfedTraces': set(), 'overfedTraces': set()}, end: {'underfedTraces': set(), 'overfedTraces': set()}, ({'reinitiate request', 'register request'}, {'check ticket'}): {'underfedTraces': set(), 'overfedTraces': set()}}

The keys of this dictionary are places, the values are dictionaries containing the set of traces for which the place is underfed and the set of traces for which the place is overfed.

Additional parameters of the token-replay algorithm, that could be passed in the parameters dictionary, are:

To use a different classifier, we recall the Classifiers section in the documentation of Process Discovery:

for trace in log:
    for event in trace:
        event["customClassifier"] = event["concept:name"] + event["concept:name"]

A parameters dictionary containing the activity key is constructed:

# import constants
from pm4py.util import constants
# define the activity key in the parameters
parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "customClassifier"}

Then the process model is calculated:

# calculate process model using the given classifier
net, initial_marking, final_marking = inductive_miner.apply(log, parameters=parameters)

And eventually the replay is done:

# apply token-based replay
replay_result = token_replay.apply(log, net, initial_marking, final_marking, parameters=parameters)

Alignments

Alignment-based replay aims to find one of the best alignment between the trace and the model. For each trace, the output of an alignment is a list of couples where the first element is an event (of the trace) or » and the second element is a transition (of the model) or ». For each couple, the following classification could be provided:

The following code implements an example for obtaining alignments. First, the running-example.xes log is loaded and the Inductive Miner is applied:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.algo.discovery.inductive import factory as inductive_miner

log = xes_importer.import_log(os.path.join("tests", "input_data", "running-example.xes"))

net, initial_marking, final_marking = inductive_miner.apply(log)

And the alignments can be obtained by this piece of code:

import pm4py
from pm4py.algo.conformance.alignments import factory as align_factory

alignments = align_factory.apply_log(log, net, initial_marking, final_marking)

If we execute print(alignments) we get the following output:

[{'alignment': [('register request', 'register request'), ('examine casually', 'examine casually'), ('>>', None), ('check ticket', 'check ticket'), ('>>', None), ('decide', 'decide'), ('reinitiate request', 'reinitiate request'), ('>>', None), ('>>', None), ('examine thoroughly', 'examine thoroughly'), ('>>', None), ('check ticket', 'check ticket'), ('>>', None), ('decide', 'decide'), ('pay compensation', 'pay compensation'), ('>>', None)], 'cost': 7, 'visited_states': 18, 'queued_states': 50, 'traversed_arcs': 100, 'fitness': 1.0}, {'alignment': [('register request', 'register request'), ('check ticket', 'check ticket'), ('>>', None), ('examine casually', 'examine casually'), ('>>', None), ('decide', 'decide'), ('pay compensation', 'pay compensation'), ('>>', None)], 'cost': 3, 'visited_states': 9, 'queued_states': 26, 'traversed_arcs': 45, 'fitness': 1.0}, {'alignment': [('register request', 'register request'), ('examine thoroughly', 'examine thoroughly'), ('>>', None), ('check ticket', 'check ticket'), ('>>', None), ('decide', 'decide'), ('reject request', 'reject request'), ('>>', None)], 'cost': 3, 'visited_states': 9, 'queued_states': 26, 'traversed_arcs': 45, 'fitness': 1.0}, {'alignment': [('register request', 'register request'), ('examine casually', 'examine casually'), ('>>', None), ('check ticket', 'check ticket'), ('>>', None), ('decide', 'decide'), ('pay compensation', 'pay compensation'), ('>>', None)], 'cost': 3, 'visited_states': 9, 'queued_states': 26, 'traversed_arcs': 45, 'fitness': 1.0}, {'alignment': [('register request', 'register request'), ('examine casually', 'examine casually'), ('>>', None), ('check ticket', 'check ticket'), ('>>', None), ('decide', 'decide'), ('reinitiate request', 'reinitiate request'), ('>>', None), ('>>', None), ('check ticket', 'check ticket'), ('>>', None), ('examine casually', 'examine casually'), ('>>', None), ('decide', 'decide'), ('reinitiate request', 'reinitiate request'), ('>>', None), ('>>', None), ('examine casually', 'examine casually'), ('>>', None), ('check ticket', 'check ticket'), ('>>', None), ('decide', 'decide'), ('reject request', 'reject request'), ('>>', None)], 'cost': 11, 'visited_states': 29, 'queued_states': 75, 'traversed_arcs': 157, 'fitness': 1.0}, {'alignment': [('register request', 'register request'), ('check ticket', 'check ticket'), ('>>', None), ('examine thoroughly', 'examine thoroughly'), ('>>', None), ('decide', 'decide'), ('reject request', 'reject request'), ('>>', None)], 'cost': 3, 'visited_states': 9, 'queued_states': 26, 'traversed_arcs': 45, 'fitness': 1.0}]

This list reports for each trace the corresponding alignment along with its statistics. With each trace, a dictionary containing among the others the following information is associated:

To use a different classifier, we recall the Classifiers section in documentation of Process Discovery. Indeed, the following code defines a custom classifier for each event of each trace in the log:

for trace in log:
    for event in trace:
        event["customClassifier"] = event["concept:name"] + event["concept:name"]

A parameters dictionary containing the activity key can be formed:

# import constants
from pm4py.util import constants
# define the activity key in the parameters
parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "customClassifier"}

Then the process model could be calculated:

# calculate process model using the given classifier
net, initial_marking, final_marking = inductive_miner.apply(log, parameters=parameters)

And eventually the replay is done:

alignments = align_factory.apply_log(log, net, initial_marking, final_marking, parameters=parameters)

To get the overall log fitness value, the following code can be used:

from pm4py.evaluation.replay_fitness import factory as replay_fitness_factory

log_fitness = replay_fitness_factory.evaluate(alignments, variant="alignments")

Using print(log_fitness) the following result is obtained:

{'percFitTraces': 100.0, 'averageFitness': 1.0}

The following parameters can also be provided to the alignments:

Implementation of a custom model cost function, and sync cost function:

model_cost_function = dict()
sync_cost_function = dict()
for t in net.transitions:
	# if the label is not None, we have a visible transition
	if t.label is not None:
		# associate cost 1000 to each move-on-model associated to visible transitions
		model_cost_function[t] = 1000
		# associate cost 0 to each move-on-log
		sync_cost_function[t] = 0
	else:
		# associate cost 1 to each move-on-model associated to hidden transitions
		model_cost_function[t] = 1

Insertion of the model cost function and sync cost function in the parameters:

parameters[pm4py.algo.conformance.alignments.versions.state_equation_a_star.PARAM_MODEL_COST_FUNCTION] = model_cost_function
parameters[pm4py.algo.conformance.alignments.versions.state_equation_a_star.PARAM_SYNC_COST_FUNCTION] = sync_cost_function

And eventually the replay is done:

alignments = align_factory.apply_log(log, net, initial_marking, final_marking, parameters=parameters)

Experimental Features

Discovery

Frequency/Performance analysis on Petri nets:

Visualization and decoration of entities:

Limiting/Filtering/Sampling entities:

Case management:

Other: