Block Diagrams with Blockdiag

Preamble

In [1]:
# used to create block diagrams
%reload_ext xdiag_magic
%xdiag_output_format svg

Introduction

You may have noticed that I programmatically generate block diagrams in many of my notebooks, such as the one below.

In [2]:
%%blockdiag
{
    orientation = portrait
    Initialisation -> Evaluation -> "Terminate?" -> Selection -> Variation -> Evaluation
    Initialisation [color = '#ffffcc']
}
blockdiag { orientation = portrait Initialisation -> Evaluation -> "Terminate?" -> Selection -> Variation -> Evaluation Initialisation [color = '#ffffcc'] } InitialisationEvaluationTerminate?SelectionVariation

I achieve this using what appears to be a relatively unknown Jupyter extension called xdiag_magic (by GitHub user kaibaesler) which provides custom magic functions for generating diagrams with blockdiag. Unfortunately, it wasn't a straight-forward conda install ... to get it up and running, but it's still quite easy.

Installation

These instructions assume you already have Jupyter Lab up and running, perhaps through Anaconda.

  1. Clone or Download this mirrored repository to your your machine. Make sure to extract it if you download it as a ZIP archive.
  2. Using the command line, navigate to the xidag_magic_extension directory, e.g. using cd.
  3. Once you're inside the directory, make sure you are within the desired environment, e.g. if you're using conda make sure you type conda activate <environment_name>.
  4. Type pip install -r requirements.txt.
  5. Type pip install -e ..

Examples

To test it, launch Jupyter Lab and open or create a notebook. You need to run the following code before generating any diagrams.

In [3]:
%reload_ext xdiag_magic
%xdiag_output_format svg

Then you can visit blockdiag's examples and try running some of their example diagrams to see the output. Your diagram code must appear in a Jupyter Lab Code cell, and be surrounded with the following.

In [4]:
%%blockdiag
{
    
}
blockdiag { }

For example:

In [5]:
%%blockdiag
{
    orientation = portrait
    A -> B -> C
}
blockdiag { orientation = portrait A -> B -> C } ABC

Single Objective Problems: Rastrigin

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation

# Optional Customisations
import plotly.io as pio              # to set shahin plot layout
pio.templates['shahin'] = pio.to_templated(
    go.Figure().update_layout(margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'
pio.renderers.default = "notebook_connected" # remove when running locally 

Introduction

Before moving on, let's take some time to have a closer look at a single-objective problem. This will give us some perspective. In single-objective problems, the objective is to find a single solution which represents the global optimum in the entire search space. Determining which solutions out-perform others is a simple task when only considering a single objective because the best solution is simply the one with the highest (for maximisation problems) or lowest (for minimisation problems) objective value. Let's take the Rastrigin function as an example.

The Rastrigin Function

This is a single-objective test function which has been expressed in Equation 1.

$$ f(\mathbf{x}) = A n + \sum_{d=1}^D \left[x_i^2 - A\cos(2 \pi x_d)\right]\tag{1} $$

where $x_d \in [-5.12, 5.12]$, i.e. each problem variable should be between $-5.12$ and $5.12$. It is continuous, convex, and unimodal. It is scalable with regards to its dimensionality in the problem domain, however, we will use its two-dimensional configuration for easy visualisation throughout this notebook. Let's implement this problem in Python so we can evaluate some solutions.

In [2]:
def rastrigin(x):
    A = 10
    y = A * D + np.sum(np.square(x) - A * np.cos(2 * np.pi * x))
    return y

Now let's prepare for the initialisation of five solutions with two problem variables. We will specify the desired population size, $\mathrm{N}$, and the dimensionality of the problem domain, $\mathrm{D}$. We will also specify the lower and upper boundaries for our problem variables.

In [3]:
N = 5
D = 2
D_lower = np.ones((1, D)) * -5.12
D_upper = np.ones((1, D)) * 5.12

Let's initialise a population with random decision variables.

In [4]:
X = pd.DataFrame(np.random.uniform(low=D_lower, high=D_upper, size=(N,D)))
X
Out[4]:
0 1
0 -3.063373 -0.869285
1 1.593357 -1.083812
2 -1.765657 3.200566
3 -3.015333 -0.102974
4 4.715264 -3.957220

To generate $\mathrm{Y}$, we will evaluate each solution $\mathrm{X}_{\mathrm{N}}$ using the rastrigin() test function defined above.

In [5]:
Y = np.empty((0, 1))

X = pd.DataFrame(np.random.uniform(low=D_lower, high=D_upper, size=(N,D)))

for n in range(N):
    y = rastrigin(X.iloc[n])
    Y = np.vstack([Y, y])
    
# convert to DataFrame
Y = pd.DataFrame(Y, columns=['f'])

We only have five solutions, so it's feasible to list all our objective values.

In [6]:
Y
Out[6]:
f
0 24.550925
1 29.051883
2 51.642749
3 42.276052
4 43.339999

We can very easily select the best solution from the above population. It is simply the solution with the smallest objective value (as we are concerned with minimisation).

In [7]:
np.min(Y)
Out[7]:
f    24.550925
dtype: float64

Note

The above example relies on random numbers. This means you will see different results upon executing the code in this notebook for yourself.

Let's see if we can get a better understanding of the problem domain by generating and visualising 3000 more solutions. This means setting $\mathrm{N}=3000$. We can achieve this by repeating the same code as above with slightly different parameters to generate our population of solutions, and then evaluate them using our rastrigin() function.

In [8]:
N = 3000

X = pd.DataFrame(
    np.random.uniform(low=D_lower, high=D_upper, size=(N,D)),
    columns=['x1', 'x2']
)

Y = np.empty((0, 1))

for n in range(N):
    y = rastrigin(X.iloc[n])
    Y = np.vstack([Y, y])
    
Y = pd.DataFrame(Y, columns=['f'])

All that's left now is to visualise the results. For this, we'll use a 3D scatterplot to plot the problem variables of each solution along with each corresponding objective value. We will reserve the vertical axis for the objective value, $f$, meaning that "lower" points represent better performing solutions.

In [9]:
fig = go.Figure(layout=dict(scene = dict(xaxis_title='x1',
                                         yaxis_title='x2',
                                         zaxis_title='f')))

fig.add_scatter3d(x=X.x1, y=X.x2, z=Y.f, 
                  mode='markers',
                  marker=dict(color = -Y.f, size=4))

fig.show()

Conclusion

In this section, we covered the very basics in what we briefly covered the topic of single-objective problems using the popular Rastrigin function as an example. The most important lesson from this section is that it is trivial to determine which solution outperforms the rest when working with a single-objective problem.

In the next section, we will demonstrate why this is more difficult for multi-objective problems, and introduce some approaches to selection.

Single Objective Problems

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation

# Optional Customisations
import plotly.io as pio              # to set shahin plot layout
pio.templates['shahin'] = pio.to_templated(
    go.Figure().update_layout(margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'
pio.renderers.default = "notebook_connected" # remove when running locally 

Introduction

Before moving on, let's take some time to have a closer look at a single-objective problem. This will give us some perspective. In single-objective problems, the objective is to find a single solution which represents the global optimum in the entire search space. Determining which solutions outperform others is a simple task when only considering a single-objective because the best solution is simply the one with the highest (for maximisation problems) or lowest (for minimisation problems) objective value. Let's take the Sphere function as an example.

The Sphere Function

This is a single-objective test function which has been expressed in Equation 1.

$$ f(\boldsymbol{x}) = \sum_{d=1}^{D} x_{d}^{2}\tag{1} $$

where $x_d \in [-5.12, 5.12]$, i.e. each problem variable should be between $-5.12$ and $5.12$. It is continuous, convex, and unimodal. It is scalable with regards to its dimensionality in the problem domain, however, we will use its two-dimensional configuration for easy visualisation throughout this notebook. Let's implement this problem in Python so we can evaluate some solutions.

In [2]:
def sphere(x):
    y = np.sum(np.square(x))
    return y

Now let's prepare for the initialisation of five solutions with two problem variables. We will specify the desired population size, $\mathrm{N}$, and the dimensionality of the problem domain, $\mathrm{D}$. We will also specify the lower and upper boundaries for our problem variables.

In [3]:
N = 5
D = 2
D_lower = np.ones((1, D)) * -5.12
D_upper = np.ones((1, D)) * 5.12

Let's initialise a population with random decision variables.

In [4]:
X = pd.DataFrame(np.random.uniform(low=D_lower, high=D_upper, size=(N,D)))
X
Out[4]:
0 1
0 0.788304 -3.793893
1 -1.902938 -3.603670
2 -4.406332 0.153391
3 5.085988 -3.172429
4 2.608262 3.743766

To generate $\mathrm{Y}$, we will evaluate each solution $\mathrm{X}_{\mathrm{N}}$ using the sphere() test function defined above.

In [5]:
Y = np.empty((0, 1))

X = pd.DataFrame(np.random.uniform(low=D_lower, high=D_upper, size=(N,D)))

for n in range(N):
    y = sphere(X.iloc[n])
    Y = np.vstack([Y, y])
    
# convert to DataFrame
Y = pd.DataFrame(Y, columns=['f'])

We only have five solutions, so it's feasible to list all our objective values.

In [6]:
Y
Out[6]:
f
0 5.955647
1 1.405197
2 3.452788
3 21.583944
4 0.199836

We can very easily select the best solution from the above population. It is simply the solution with the smallest objective value (as we are concerned with minimisation).

In [7]:
np.min(Y)
Out[7]:
f    0.199836
dtype: float64

Note

The above example relies on random numbers. This means you will see different results upon executing the code in this notebook for yourself.

Let's see if we can get a better understanding of the problem domain by generating and visualising 1000 more solutions. This means setting $\mathrm{N}=1000$. We can achieve this by repeating the same code as above with slightly different parameters to generate our population of solutions, and then evaluate them using our sphere() function.

In [8]:
N = 1000

X = pd.DataFrame(
    np.random.uniform(low=D_lower, high=D_upper, size=(N,D)),
    columns=['x1', 'x2']
)

Y = np.empty((0, 1))

for n in range(N):
    y = sphere(X.iloc[n])
    Y = np.vstack([Y, y])
    
Y = pd.DataFrame(Y, columns=['f'])

All that's left now is to visualise the results. For this, we'll use a 3D scatterplot to plot the problem variables of each solution along with each corresponding objective value. We will reserve the vertical axis for the objective value, $f$, meaning that "lower" points represent better performing solutions.

In [9]:
fig = go.Figure(layout=dict(scene = dict(xaxis_title='x1',
                                         yaxis_title='x2',
                                         zaxis_title='f')))

fig.add_scatter3d(x=X.x1, y=X.x2, z=Y.f, 
                  mode='markers',
                  marker=dict(color = -Y.f))

fig.show()

Conclusion

In this section, we covered the very basics on the topic of single-objective optimisation problems using the popular Sphere function as an example. The most important lesson from this section is that it is trivial to determine which solution outperforms the rest when working with a single-objective problem.

In the next section, we will demonstrate why this is more difficult for multi-objective problems, and introduce some approaches to selection.

Exercise

Repeat the same experiment in this section, but this time use the Rastrigin function (Equation 2, where $\mathrm{A} = 10$) instead of the Sphere function. $$ f(\mathbf{x}) = A n + \sum_{d=1}^D \left[x_i^2 - A\cos(2 \pi x_d)\right]\tag{2} $$

Setup Anaconda, Jupyter, and Tensorflow v1

Software Setup

We are taking a practical approach in the following sections. As such, we need the right tools and environments available in order to keep up with the examples and exercises. We will be using Python 3 along with typical packages from its scientific stack, such as numpy (for multi-dimensional containers), pandas (for DataFrames), plotly (for plotting), etc. These instructions are written for Tensorflow v1, so you will see some specific package versions listed throughout for compatibility. We will write all of our code within a Jupyter Notebook, but you are free to use other IDEs such as PyCharm or Spyder.

Figure 1 - A Jupyter Notebook being edited within Jupyter Lab.
Theme from https://github.com/shahinrostami/theme-purple-please

Get Anaconda

There are many different ways to get up and running with an environment that will facilitate our work. One approach I can recommend is to install and use Anaconda.

Anaconda® is a package manager, an environment manager, a Python/R data science distribution, and a collection of over 1,500+ open source packages. Anaconda is free and easy to install, and it offers free community support.

https://docs.anaconda.com/anaconda/

To get up and running, just visit the Anaconda website and download a version of Anaconda for your operating system. I recommend getting the latest Anaconda version (2019.07 at the time of writing this section), and selecting a Python 3.X version.

Anaconda Download

Figure 2 - Downloading an Anaconda Distribution for your operating system.

Create Your Environment

Once Anaconda is installed, we need to create and configure our environment. Again, there are many ways to accomplish this. You can complete all the steps using the Anaconda Navigator (graphical interface), but we will use the command line interface, simply because it will give us a better report if and when something goes wrong.

If you added Anaconda to your PATH environment during the installation process, then you can run these commands directly from Terminal, Powershell, or CMD. If you didn't, then you can search for and run Anaconda Prompt.

Anaconda Prompt

Figure 3 - Searching for Anaconda Prompt on Windows 10.

Now we can create and configure our conda environment using the following commands.

conda create -n dmat python=3.6

This will create a conda environment named dmat with the latest Python 3.6 package ready to go. You should be presented with a list of packages that will be installed, and asked if you wish to proceed. To do so, just enter the character y. If this operation is successful, you should see the following output at the end:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate dmat
#
# To deactivate an active environment, use
#
#     $ conda deactivate

As the message suggests, you will need to type the following command to activate and start entering commands within our environment named dmat.

conda activate dmat

Once you do that, you should see your terminal prompt now leads with the environment name within parantheses:

(dmat) melica:~ shahin$

Note

The example above shows the macOS machine name "melica" and the user "shahin". You will see something different on your own machine, and it may appear in a different format on a different operating system such as Windows. As long as the prompt leads with "(dmat)", you are on the right track.

This will allow you to identify which environment you are currently operating in. If you restart your machine, you should be able to use conda activate dmat within your Anaconda prompt to get back into the same environment.

Install Packages

If you environment was already configured and ready, you would be able to enter the command jupyter lab to launch an instance of the Jupyter Lab IDE in the current directory. However, if we try that in our newly created environment, we will receive an error:

(dmat) melica:~ shahin$ jupyter lab
-bash: jupyter: command not found

So let's fix that. Let's install a few packages we know we will be needed:

  • Jupyter Lab
  • Numpy
  • Pandas
  • Plotly
  • Tensorflow

We will do them one-by-one to get a better idea of any errors if they occur. This time we will include the -y option which automatically says "yes" to any questions asking during the installation process.

conda install jupyterlab -y
conda install numpy=1.16 -y
conda install pandas -y
conda install plotly -y

You could do this in a single command, e.g.:

conda install jupyterlab numpy=1.16 pandas plotly -y

Let's install nodejs. This is needed to run our Jupyter Lab extension in the next section.

conda install -c conda-forge nodejs -y

Finally, let's install Tensorflow v1.

conda install -c conda-forge tensorflow=1

Install Jupyer Lab Extensions

There's one last thing we need to do before we move on, and that's installing any Jupyter Lab extensions that we may need. One particular extension that we need is the plotly extension, which will allow our Jupyter Notebooks to render our Plotly visualisations. Within your conda environment, simply run the following command:

jupyter labextension install @jupyterlab/plotly-extension

This may take some time, especially when it builds your jupyterlab assets, so keep an eye on it until you're returned control over the anaconda prompt, i.e. when you see the following:

(dmat) melica:~ shahin$

Now we're good to go!

A Quick Test

Let's test if everything is working as it should be. In your anaconda prompt, within your conda environment, run the following command

jupyter lab

This should start the Jupyter Lab server and launch a browser window with the IDE ready to use.

Jupyter Lab

Figure 4 - A fresh installation of Jupyter Lab.

Let's create a new notebook. In the Launcher tab which has opened by default, click "Python 3" under the Notebook heading. This will create a new and empty notebook named Untitled.ipynb in the current directory.

Let's try to import our packages. If everything is configured as it should be, you should see no errors. Type the following into the first cell and click "play" button to execute it and create a new cell.

In [1]:
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.express as px          # plotly express
import plotly.graph_objects as go    # for data visualisation
import plotly.io as pio              # to set shahin plot layout
from tensorflow import keras         # neural networks API

pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'

If we followed all the instructions and didn't encounter any errors, everything should be working. The last two lines beginning with pio.templates are modifications made to the colours/margins of all figures rendered by Plotly, and do not need to be included.

Enter the following code into the new and empty cell (below the first one) to test if Plotly visualisations will render within the notebook.

In [13]:
tips = px.data.tips()
fig = px.histogram(tips, x="total_bill", color="sex")
fig.show()

If the Jupyter Lab extension is installed and functioning, you should see an interactive histogram.

Summary

In this section we've downloaded, installed, configured, and tested our environment such that we're ready to run the following examples and experiments. If you ever find that you're missing packages, you can install them in the same way as we installed numpy and the others in this section.

Using a Framework and the ZDT Test Suite

Preamble

In [1]:
# used to create block diagrams
%reload_ext xdiag_magic
%xdiag_output_format svg
    
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import platypus as plat              # multi-objective optimisation framework

Introduction

When preparing to implement multi-objective optimisation experiments, it's often more convenient to use a ready-made framework/library instead of programming everything from scratch. There are many libraries and frameworks that have been implemented in many different programming languages, but as we're using Python we will be selecting from frameworks such as DEAP, PyGMO, and Platypus.

With our focus on multi-objective optimisation, our choice is an easy one. We will choose Platypus which has a focus on multi-objective problems and optimisation.

Platypus is a framework for evolutionary computing in Python with a focus on multiobjective evolutionary algorithms (MOEAs). It differs from existing optimization libraries, including PyGMO, Inspyred, DEAP, and Scipy, by providing optimization algorithms and analysis tools for multiobjective optimization.

As a first look into Platypus, let's repeat the process covered in the earlier section on "Synthetic Objective Functions and ZDT1", where we randomly initialise a solution and then evaluate it using ZDT1.

In [2]:
%%blockdiag
{
    orientation = portrait
    "Problem Variables" -> "Test Function" -> "Objective Values"
    "Test Function" [color = '#ffffcc']
}
blockdiag { orientation = portrait "Problem Variables" -> "Test Function" -> "Objective Values" "Test Function" [color = '#ffffcc'] } Problem VariablesTest FunctionObjective Values

The ZDT test function

Similar to the last time, we will be using a synthetic test problem throughout this notebook called ZDT1. It is part of the ZDT test suite, consisting of six different two-objective synthetic test problems. This is quite an old test suite, easy to solve, and very easy to visualise.

Mathematically, the ZDT11 two-objective test function can be expressed as:

$$ \begin{aligned} f_1(x_1) &= x_1 \tag{1} \\ f_2(x) &= g \cdot h \\ g(x_2,\ldots,x_{\mathrm{D}}) &= 1 + 9 \cdot \sum_{d=2}^{\mathrm{D}} \frac{x_d}{(D-1)}\\ h(f_1,g) &= 1 - \sqrt{f1/g} \end{aligned} $$

where $x$ is a solution to the problem, defined as a vector of $D$ decision variables.

$$ x= \langle x_{1},x_{2},\ldots,x_{\mathrm{D}} \rangle \tag{2} $$

and all decision variables fall between $0$ and $1$.

$$ 0 \le x_d \le 1, d=1,\ldots,\mathrm{D} \tag{3} $$

For this bi-objective test function, $f_1$ is the first objective, and $f_2$ is the second objective. This particular objective function is, by design, scalable up to any number of problem variables but is restricted to two problem objectives.

Let's start implementing this in Python, beginning with the initialisation of a solution according to Equations 2 and 3. In this case, we will have 30 problem variables $\mathrm{D}=30$.

In [3]:
D = 30
x = np.random.rand(D)
print(x)
[0.40700054 0.04071341 0.2019138  0.79142779 0.9140935  0.03581037
 0.62395881 0.85908972 0.59781233 0.12436693 0.67729958 0.31991784
 0.54294362 0.62353553 0.70098228 0.03701772 0.08176195 0.70208965
 0.40398321 0.41584089 0.71320878 0.46511698 0.28813617 0.86793829
 0.65494094 0.5299217  0.98657349 0.46780006 0.23001727 0.41083113]

Now that we have a solution to evaluate, let's implement the ZDT1 synthetic test function using Equation 1.

In [4]:
def ZDT1(x):
    f1 = x[0]  # objective 1
    g = 1 + 9 * np.sum(x[1:D] / (D-1))
    h = 1 - np.sqrt(f1 / g)
    f2 = g * h  # objective 2
    
    return [f1, f2]

Finally, let's invoke our implemented test function using our solution $x$ from earlier.

In [5]:
objective_values = ZDT1(x)
print(objective_values)
[0.4070005418391035, 3.9526573425595815]

Now we can see the two objective values that measure our solution $x$ according to the ZDT1 synthetic test function, which is a minimisation problem.

Using a Framework

We've quickly repeated our earlier exercise, where we move from our mathematical description of ZDT1 to an implementation in Python. Now, let's use the Platypus' implementation of ZDT1, which would have saves us from having to implement it in Python ourselves.

We have already imported Platypus as plat above, so to get an instance of ZDT1 all we need to do is use the object constructor.

In [6]:
problem = plat.ZDT1()

Just like that, our variable problem references an instance of the ZDT1 test problem.

Now we need to create a solution in a structure that is defined by Platypus. This solution object is what Platypus expects when performing all of the operations that it provides.

In [7]:
solution = plat.Solution(problem)

By using the Solution() constructor and passing in our earlier instantiated problem, the solution is initialised with the correct number of variables and objectives. We can check this ourselves.

In [8]:
print(f"This solution's variables are set to:\n{solution.variables}")
print(f"This solution has {len(solution.variables)} variables")
This solution's variables are set to:
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
This solution has 30 variables
In [9]:
print(f"This solution's objectives are set to:\n{solution.objectives}")
print(f"This solution has {len(solution.objectives)} objectives")
This solution's objectives are set to:
[None, None]
This solution has 2 objectives

Earlier in this section we randomly generated 30 problem variables and stored them in the variable x. Let's assign this to our variables and check that it works.

In [10]:
solution.variables = x
print(f"This solution's variables are set to:\n{solution.variables}")
This solution's variables are set to:
[0.40700054 0.04071341 0.2019138  0.79142779 0.9140935  0.03581037
 0.62395881 0.85908972 0.59781233 0.12436693 0.67729958 0.31991784
 0.54294362 0.62353553 0.70098228 0.03701772 0.08176195 0.70208965
 0.40398321 0.41584089 0.71320878 0.46511698 0.28813617 0.86793829
 0.65494094 0.5299217  0.98657349 0.46780006 0.23001727 0.41083113]

Now we can invoke the evaluate() method which will use the assigned problem to evaluate the problem variables and calculate the objective values. We can print these out afterwards to see the results.

In [11]:
solution.evaluate()
print(solution.objectives)
[0.4070005418391035, 3.9526573425595815]

These objectives values should be the same as the ones that were calculated by our own implementation of ZDT1, within some margin of error.

Conclusion

In this section, we introduced a framework for multi-objective optimisation and analysis. We used it to create an instance of the ZDT1 test problem, which we then used to initialise an empty solution. We then assigned randomly generated problem variables to this solution and evaluated it with the ZDT test function to determine the objective values.

Exercise

Using the framework introduced in this section, evaluate a number of randomly generated solutions for ZDT2, ZDT3, ZDT4, and ZDT6.


  1. E. Zitzler, K. Deb, and L. Thiele. Comparison of Multiobjective Evolutionary Algorithms: Empirical Results. Evolutionary Computation, 8(2):173-195, 2000 

CVSS Exploratory Data Analysis

Preamble

In [1]:
# used to create block diagrams
%reload_ext xdiag_magic
%xdiag_output_format svg
    
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation
import plotly.io as pio              # to set shahin plot layout
from wordcloud import WordCloud      # visualising word clouds
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 10]
pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'
pio.renderers.default = "notebook_connected"

Dataset

In [2]:
data = pd.read_csv('https://shahinrostami.com/datasets/enisa_vuln.csv',
                   low_memory=False, index_col='id')
data['date_published'] = pd.to_datetime(data['date_published']).dt.date

data.tail()
Out[2]:
source_db source_db_id cna cvss3_bscore cvss3_severity cvss3_impact cvss3_exploitability cvss3_attack cvss3_complexity cvss3_priveleges ... EOS_product EOS_version EOS_date Type 0day Today 0day_low_y 0day_upper Today_low Today_upper
id
27466 nvd.nist.gov CVE-2018-20877 MITRE Corporation 5.4 MEDIUM 2.7 2.3 Network Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27467 nvd.nist.gov CVE-2018-20876 MITRE Corporation 5.4 MEDIUM 2.7 2.3 Network Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27468 nvd.nist.gov CVE-2018-20875 MITRE Corporation 5.4 MEDIUM 2.7 2.3 Network Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27469 nvd.nist.gov CVE-2018-20874 MITRE Corporation 5.4 MEDIUM 2.7 2.3 Network Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27470 nvd.nist.gov CVE-2018-20873 MITRE Corporation 3.3 LOW 1.4 1.8 Local Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 57 columns

Introduction

In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27471 entries, 0 to 27470
Data columns (total 57 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   source_db               27471 non-null  object 
 1   source_db_id            27471 non-null  object 
 2   cna                     27471 non-null  object 
 3   cvss3_bscore            27471 non-null  float64
 4   cvss3_severity          27471 non-null  object 
 5   cvss3_impact            27471 non-null  float64
 6   cvss3_exploitability    27471 non-null  float64
 7   cvss3_attack            27471 non-null  object 
 8   cvss3_complexity        27471 non-null  object 
 9   cvss3_priveleges        27471 non-null  object 
 10  cvss3_user_interaction  27471 non-null  object 
 11  cvss3_scope             27471 non-null  object 
 12  cvss3_confidentiality   27471 non-null  object 
 13  cvss3_integrity         27471 non-null  object 
 14  cvss3_availability      27471 non-null  object 
 15  cvss2_bscore            27471 non-null  float64
 16  cvss2_severity          27471 non-null  object 
 17  cvss2_impact            27471 non-null  float64
 18  cvss2_exploitability    27471 non-null  float64
 19  cvss2_access            27471 non-null  object 
 20  cvss2_complexity        27471 non-null  object 
 21  cvss2_authentication    27471 non-null  object 
 22  cvss2_confidentiality   27471 non-null  object 
 23  cvss2_integrity         27471 non-null  object 
 24  cvss2_availability      27471 non-null  object 
 25  cwe                     27471 non-null  object 
 26  capec                   21335 non-null  object 
 27  cpe                     27462 non-null  object 
 28  description             27471 non-null  object 
 29  n_exploits              27471 non-null  int64  
 30  technique_id            8077 non-null   object 
 31  tactic                  8067 non-null   object 
 32  date_published          27471 non-null  object 
 33  date_modified           27471 non-null  object 
 34  history_summary         308 non-null    object 
 35  date_exploit            2371 non-null   object 
 36  0day_low_x              3390 non-null   object 
 37  0day_high               3353 non-null   float64
 38  today_low               3389 non-null   float64
 39  today_high              3389 non-null   float64
 40  exp_type                2371 non-null   object 
 41  platform                2371 non-null   object 
 42  exp_verified            2371 non-null   object 
 43  vendor                  23110 non-null  object 
 44  product                 23108 non-null  object 
 45  sector                  136 non-null    object 
 46  incident                2169 non-null   float64
 47  EOS_product             381 non-null    object 
 48  EOS_version             381 non-null    object 
 49  EOS_date                381 non-null    object 
 50  Type                    3369 non-null   object 
 51  0day                    3369 non-null   object 
 52  Today                   3369 non-null   object 
 53  0day_low_y              3369 non-null   float64
 54  0day_upper              3369 non-null   float64
 55  Today_low               3369 non-null   float64
 56  Today_upper             3369 non-null   float64
dtypes: float64(14), int64(1), object(42)
memory usage: 12.2+ MB
In [4]:
print(f"Earliest date {data.date_published.min()}")
print(f"Latest date {data.date_published.max()}")
print(f"Over {(data.date_published.max() - data.date_published.min()).days} days")
Earliest date 2018-01-01
Latest date 2019-08-30
Over 606 days

Two numerical features, cvss_score3 and cvss_score2. There is a difference in severity classification and base score range between the two metrics. E.g. a score of $> 9.0$ is classified as "Critical" in CVSS3, but only "High" in CVSS2.

In [5]:
data.describe()
Out[5]:
cvss3_bscore cvss3_impact cvss3_exploitability cvss2_bscore cvss2_impact cvss2_exploitability n_exploits 0day_high today_low today_high incident 0day_low_y 0day_upper Today_low Today_upper
count 27471.000000 27471.000000 27471.000000 27471.000000 27471.000000 27471.000000 27471.000000 3353.000000 3389.000000 3389.000000 2169.000000 3369.000000 3369.0 3369.000000 3369.000000
mean 7.339012 4.416301 2.783615 5.749459 5.229668 8.122318 0.131666 24634.059052 3375.627029 8068.161700 1.946519 13223.211636 inf 3538.438706 8420.302760
std 1.612830 1.479722 0.913636 1.888787 2.552895 2.005292 1.286410 26930.853685 4372.583724 9670.596732 2.327816 18106.114904 NaN 4605.882592 10109.008754
min 1.800000 1.400000 0.100000 1.200000 2.900000 1.200000 0.000000 1000.000000 0.000000 1000.000000 0.000000 0.000000 1000.0 0.000000 1000.000000
25% 6.100000 3.600000 1.800000 4.300000 2.900000 8.000000 0.000000 5000.000000 0.000000 1000.000000 1.000000 2000.000000 5000.0 0.000000 1000.000000
50% 7.500000 3.600000 2.800000 5.000000 4.900000 8.600000 0.000000 25000.000000 2000.000000 5000.000000 2.000000 10000.000000 25000.0 2000.000000 5000.000000
75% 8.800000 5.900000 3.900000 7.200000 6.400000 10.000000 0.000000 25000.000000 5000.000000 10000.000000 2.000000 10000.000000 25000.0 5000.000000 10000.000000
max 10.000000 6.000000 3.900000 10.000000 10.000000 10.000000 123.000000 100000.000000 25000.000000 50000.000000 57.000000 100000.000000 inf 25000.000000 50000.000000

How many missing vulnerability scores?

In [6]:
data.isna().sum()
Out[6]:
source_db                     0
source_db_id                  0
cna                           0
cvss3_bscore                  0
cvss3_severity                0
cvss3_impact                  0
cvss3_exploitability          0
cvss3_attack                  0
cvss3_complexity              0
cvss3_priveleges              0
cvss3_user_interaction        0
cvss3_scope                   0
cvss3_confidentiality         0
cvss3_integrity               0
cvss3_availability            0
cvss2_bscore                  0
cvss2_severity                0
cvss2_impact                  0
cvss2_exploitability          0
cvss2_access                  0
cvss2_complexity              0
cvss2_authentication          0
cvss2_confidentiality         0
cvss2_integrity               0
cvss2_availability            0
cwe                           0
capec                      6136
cpe                           9
description                   0
n_exploits                    0
technique_id              19394
tactic                    19404
date_published                0
date_modified                 0
history_summary           27163
date_exploit              25100
0day_low_x                24081
0day_high                 24118
today_low                 24082
today_high                24082
exp_type                  25100
platform                  25100
exp_verified              25100
vendor                     4361
product                    4363
sector                    27335
incident                  25302
EOS_product               27090
EOS_version               27090
EOS_date                  27090
Type                      24102
0day                      24102
Today                     24102
0day_low_y                24102
0day_upper                24102
Today_low                 24102
Today_upper               24102
dtype: int64
In [7]:
fig = go.Figure()

fig.add_trace(go.Box(y=data.cvss3_bscore, name='CVSS3'))
fig.add_trace(go.Box(y=data.cvss2_bscore, name='CVSS2'))

fig.show()

CVSS distributions

In [8]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=data.cvss3_bscore, name='CVSS3'))
fig.add_trace(go.Histogram(x=data.cvss2_bscore, name='CVSS2'))

fig.update_traces(opacity=0.75)
fig.show()