# Block Diagrams with Blockdiag

## Preamble¶

In [1]:
# used to create block diagrams
%xdiag_output_format svg


## Introduction¶

You may have noticed that I programmatically generate block diagrams in many of my notebooks, such as the one below.

In [2]:
%%blockdiag
{
orientation = portrait
Initialisation -> Evaluation -> "Terminate?" -> Selection -> Variation -> Evaluation
Initialisation [color = '#ffffcc']
}


I achieve this using what appears to be a relatively unknown Jupyter extension called xdiag_magic (by GitHub user kaibaesler) which provides custom magic functions for generating diagrams with blockdiag. Unfortunately, it wasn't a straight-forward conda install ... to get it up and running, but it's still quite easy.

## Installation¶

These instructions assume you already have Jupyter Lab up and running, perhaps through Anaconda.

2. Using the command line, navigate to the xidag_magic_extension directory, e.g. using cd.
3. Once you're inside the directory, make sure you are within the desired environment, e.g. if you're using conda make sure you type conda activate <environment_name>.
4. Type pip install -r requirements.txt.
5. Type pip install -e ..

## Examples¶

To test it, launch Jupyter Lab and open or create a notebook. You need to run the following code before generating any diagrams.

In [3]:
%reload_ext xdiag_magic
%xdiag_output_format svg


Then you can visit blockdiag's examples and try running some of their example diagrams to see the output. Your diagram code must appear in a Jupyter Lab Code cell, and be surrounded with the following.

In [4]:
%%blockdiag
{

}


For example:

In [5]:
%%blockdiag
{
orientation = portrait
A -> B -> C
}


# Single Objective Problems: Rastrigin

## Preamble¶

In [1]:
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation

# Optional Customisations
import plotly.io as pio              # to set shahin plot layout
pio.templates['shahin'] = pio.to_templated(
go.Figure().update_layout(margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'
pio.renderers.default = "notebook_connected" # remove when running locally


## Introduction¶

Before moving on, let's take some time to have a closer look at a single-objective problem. This will give us some perspective. In single-objective problems, the objective is to find a single solution which represents the global optimum in the entire search space. Determining which solutions out-perform others is a simple task when only considering a single objective because the best solution is simply the one with the highest (for maximisation problems) or lowest (for minimisation problems) objective value. Let's take the Rastrigin function as an example.

## The Rastrigin Function¶

This is a single-objective test function which has been expressed in Equation 1.

$$f(\mathbf{x}) = A n + \sum_{d=1}^D \left[x_i^2 - A\cos(2 \pi x_d)\right]\tag{1}$$

where $x_d \in [-5.12, 5.12]$, i.e. each problem variable should be between $-5.12$ and $5.12$. It is continuous, convex, and unimodal. It is scalable with regards to its dimensionality in the problem domain, however, we will use its two-dimensional configuration for easy visualisation throughout this notebook. Let's implement this problem in Python so we can evaluate some solutions.

In [2]:
def rastrigin(x):
A = 10
y = A * D + np.sum(np.square(x) - A * np.cos(2 * np.pi * x))
return y


Now let's prepare for the initialisation of five solutions with two problem variables. We will specify the desired population size, $\mathrm{N}$, and the dimensionality of the problem domain, $\mathrm{D}$. We will also specify the lower and upper boundaries for our problem variables.

In [3]:
N = 5
D = 2
D_lower = np.ones((1, D)) * -5.12
D_upper = np.ones((1, D)) * 5.12


Let's initialise a population with random decision variables.

In [4]:
X = pd.DataFrame(np.random.uniform(low=D_lower, high=D_upper, size=(N,D)))
X

Out[4]:
0 1
0 -3.063373 -0.869285
1 1.593357 -1.083812
2 -1.765657 3.200566
3 -3.015333 -0.102974
4 4.715264 -3.957220

To generate $\mathrm{Y}$, we will evaluate each solution $\mathrm{X}_{\mathrm{N}}$ using the rastrigin() test function defined above.

In [5]:
Y = np.empty((0, 1))

X = pd.DataFrame(np.random.uniform(low=D_lower, high=D_upper, size=(N,D)))

for n in range(N):
y = rastrigin(X.iloc[n])
Y = np.vstack([Y, y])

# convert to DataFrame
Y = pd.DataFrame(Y, columns=['f'])


We only have five solutions, so it's feasible to list all our objective values.

In [6]:
Y

Out[6]:
f
0 24.550925
1 29.051883
2 51.642749
3 42.276052
4 43.339999

We can very easily select the best solution from the above population. It is simply the solution with the smallest objective value (as we are concerned with minimisation).

In [7]:
np.min(Y)

Out[7]:
f    24.550925
dtype: float64

Note

The above example relies on random numbers. This means you will see different results upon executing the code in this notebook for yourself.

Let's see if we can get a better understanding of the problem domain by generating and visualising 3000 more solutions. This means setting $\mathrm{N}=3000$. We can achieve this by repeating the same code as above with slightly different parameters to generate our population of solutions, and then evaluate them using our rastrigin() function.

In [8]:
N = 3000

X = pd.DataFrame(
np.random.uniform(low=D_lower, high=D_upper, size=(N,D)),
columns=['x1', 'x2']
)

Y = np.empty((0, 1))

for n in range(N):
y = rastrigin(X.iloc[n])
Y = np.vstack([Y, y])

Y = pd.DataFrame(Y, columns=['f'])


All that's left now is to visualise the results. For this, we'll use a 3D scatterplot to plot the problem variables of each solution along with each corresponding objective value. We will reserve the vertical axis for the objective value, $f$, meaning that "lower" points represent better performing solutions.

In [9]:
fig = go.Figure(layout=dict(scene = dict(xaxis_title='x1',
yaxis_title='x2',
zaxis_title='f')))

mode='markers',
marker=dict(color = -Y.f, size=4))

fig.show()


## Conclusion¶

In this section, we covered the very basics in what we briefly covered the topic of single-objective problems using the popular Rastrigin function as an example. The most important lesson from this section is that it is trivial to determine which solution outperforms the rest when working with a single-objective problem.

In the next section, we will demonstrate why this is more difficult for multi-objective problems, and introduce some approaches to selection.

# Single Objective Problems

## Preamble¶

In [1]:
import numpy as np                   # for multi-dimensional containers
import pandas as pd                  # for DataFrames
import plotly.graph_objects as go    # for data visualisation

# Optional Customisations
import plotly.io as pio              # to set shahin plot layout
pio.templates['shahin'] = pio.to_templated(
go.Figure().update_layout(margin=dict(t=0,r=0,b=40,l=40))).layout.template
pio.templates.default = 'shahin'
pio.renderers.default = "notebook_connected" # remove when running locally


## Introduction¶

Before moving on, let's take some time to have a closer look at a single-objective problem. This will give us some perspective. In single-objective problems, the objective is to find a single solution which represents the global optimum in the entire search space. Determining which solutions outperform others is a simple task when only considering a single-objective because the best solution is simply the one with the highest (for maximisation problems) or lowest (for minimisation problems) objective value. Let's take the Sphere function as an example.

## The Sphere Function¶

This is a single-objective test function which has been expressed in Equation 1.

$$f(\boldsymbol{x}) = \sum_{d=1}^{D} x_{d}^{2}\tag{1}$$

where $x_d \in [-5.12, 5.12]$, i.e. each problem variable should be between $-5.12$ and $5.12$. It is continuous, convex, and unimodal. It is scalable with regards to its dimensionality in the problem domain, however, we will use its two-dimensional configuration for easy visualisation throughout this notebook. Let's implement this problem in Python so we can evaluate some solutions.

In [2]:
def sphere(x):
y = np.sum(np.square(x))
return y


Now let's prepare for the initialisation of five solutions with two problem variables. We will specify the desired population size, $\mathrm{N}$, and the dimensionality of the problem domain, $\mathrm{D}$. We will also specify the lower and upper boundaries for our problem variables.

In [3]:
N = 5
D = 2
D_lower = np.ones((1, D)) * -5.12
D_upper = np.ones((1, D)) * 5.12


Let's initialise a population with random decision variables.

In [4]:
X = pd.DataFrame(np.random.uniform(low=D_lower, high=D_upper, size=(N,D)))
X

Out[4]:
0 1
0 0.788304 -3.793893
1 -1.902938 -3.603670
2 -4.406332 0.153391
3 5.085988 -3.172429
4 2.608262 3.743766

To generate $\mathrm{Y}$, we will evaluate each solution $\mathrm{X}_{\mathrm{N}}$ using the sphere() test function defined above.

In [5]:
Y = np.empty((0, 1))

X = pd.DataFrame(np.random.uniform(low=D_lower, high=D_upper, size=(N,D)))

for n in range(N):
y = sphere(X.iloc[n])
Y = np.vstack([Y, y])

# convert to DataFrame
Y = pd.DataFrame(Y, columns=['f'])


We only have five solutions, so it's feasible to list all our objective values.

In [6]:
Y

Out[6]:
f
0 5.955647
1 1.405197
2 3.452788
3 21.583944
4 0.199836

We can very easily select the best solution from the above population. It is simply the solution with the smallest objective value (as we are concerned with minimisation).

In [7]:
np.min(Y)

Out[7]:
f    0.199836
dtype: float64

Note

The above example relies on random numbers. This means you will see different results upon executing the code in this notebook for yourself.

Let's see if we can get a better understanding of the problem domain by generating and visualising 1000 more solutions. This means setting $\mathrm{N}=1000$. We can achieve this by repeating the same code as above with slightly different parameters to generate our population of solutions, and then evaluate them using our sphere() function.

In [8]:
N = 1000

X = pd.DataFrame(
np.random.uniform(low=D_lower, high=D_upper, size=(N,D)),
columns=['x1', 'x2']
)

Y = np.empty((0, 1))

for n in range(N):
y = sphere(X.iloc[n])
Y = np.vstack([Y, y])

Y = pd.DataFrame(Y, columns=['f'])


All that's left now is to visualise the results. For this, we'll use a 3D scatterplot to plot the problem variables of each solution along with each corresponding objective value. We will reserve the vertical axis for the objective value, $f$, meaning that "lower" points represent better performing solutions.

In [9]:
fig = go.Figure(layout=dict(scene = dict(xaxis_title='x1',
yaxis_title='x2',
zaxis_title='f')))

mode='markers',
marker=dict(color = -Y.f))

fig.show()


## Conclusion¶

In this section, we covered the very basics on the topic of single-objective optimisation problems using the popular Sphere function as an example. The most important lesson from this section is that it is trivial to determine which solution outperforms the rest when working with a single-objective problem.

In the next section, we will demonstrate why this is more difficult for multi-objective problems, and introduce some approaches to selection.

Exercise

Repeat the same experiment in this section, but this time use the Rastrigin function (Equation 2, where $\mathrm{A} = 10$) instead of the Sphere function. $$f(\mathbf{x}) = A n + \sum_{d=1}^D \left[x_i^2 - A\cos(2 \pi x_d)\right]\tag{2}$$

# Setup Anaconda, Jupyter, and Tensorflow v1

## Software Setup¶

We are taking a practical approach in the following sections. As such, we need the right tools and environments available in order to keep up with the examples and exercises. We will be using Python 3 along with typical packages from its scientific stack, such as numpy (for multi-dimensional containers), pandas (for DataFrames), plotly (for plotting), etc. These instructions are written for Tensorflow v1, so you will see some specific package versions listed throughout for compatibility. We will write all of our code within a Jupyter Notebook, but you are free to use other IDEs such as PyCharm or Spyder.

Figure 1 - A Jupyter Notebook being edited within Jupyter Lab.

## Get Anaconda¶

There are many different ways to get up and running with an environment that will facilitate our work. One approach I can recommend is to install and use Anaconda.

Anaconda® is a package manager, an environment manager, a Python/R data science distribution, and a collection of over 1,500+ open source packages. Anaconda is free and easy to install, and it offers free community support.

https://docs.anaconda.com/anaconda/

To get up and running, just visit the Anaconda website and download a version of Anaconda for your operating system. I recommend getting the latest Anaconda version (2019.07 at the time of writing this section), and selecting a Python 3.X version.

Once Anaconda is installed, we need to create and configure our environment. Again, there are many ways to accomplish this. You can complete all the steps using the Anaconda Navigator (graphical interface), but we will use the command line interface, simply because it will give us a better report if and when something goes wrong.

If you added Anaconda to your PATH environment during the installation process, then you can run these commands directly from Terminal, Powershell, or CMD. If you didn't, then you can search for and run Anaconda Prompt.

Figure 3 - Searching for Anaconda Prompt on Windows 10.

Now we can create and configure our conda environment using the following commands.

conda create -n dmat python=3.6


This will create a conda environment named dmat with the latest Python 3.6 package ready to go. You should be presented with a list of packages that will be installed, and asked if you wish to proceed. To do so, just enter the character y. If this operation is successful, you should see the following output at the end:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $conda activate dmat # # To deactivate an active environment, use # #$ conda deactivate


As the message suggests, you will need to type the following command to activate and start entering commands within our environment named dmat.

conda activate dmat


Once you do that, you should see your terminal prompt now leads with the environment name within parantheses:

(dmat) melica:~ shahin$ Note The example above shows the macOS machine name "melica" and the user "shahin". You will see something different on your own machine, and it may appear in a different format on a different operating system such as Windows. As long as the prompt leads with "(dmat)", you are on the right track. This will allow you to identify which environment you are currently operating in. If you restart your machine, you should be able to use conda activate dmat within your Anaconda prompt to get back into the same environment. ## Install Packages¶ If you environment was already configured and ready, you would be able to enter the command jupyter lab to launch an instance of the Jupyter Lab IDE in the current directory. However, if we try that in our newly created environment, we will receive an error: (dmat) melica:~ shahin$ jupyter lab


So let's fix that. Let's install a few packages we know we will be needed:

• Jupyter Lab
• Numpy
• Pandas
• Plotly
• Tensorflow

We will do them one-by-one to get a better idea of any errors if they occur. This time we will include the -y option which automatically says "yes" to any questions asking during the installation process.

conda install jupyterlab -y
conda install numpy=1.16 -y
conda install pandas -y
conda install plotly -y


You could do this in a single command, e.g.:

conda install jupyterlab numpy=1.16 pandas plotly -y


Let's install nodejs. This is needed to run our Jupyter Lab extension in the next section.

conda install -c conda-forge nodejs -y


Finally, let's install Tensorflow v1.

conda install -c conda-forge tensorflow=1


## Install Jupyer Lab Extensions¶

There's one last thing we need to do before we move on, and that's installing any Jupyter Lab extensions that we may need. One particular extension that we need is the plotly extension, which will allow our Jupyter Notebooks to render our Plotly visualisations. Within your conda environment, simply run the following command:

jupyter labextension install @jupyterlab/plotly-extension


This may take some time, especially when it builds your jupyterlab assets, so keep an eye on it until you're returned control over the anaconda prompt, i.e. when you see the following:

(dmat) melica:~ shahin Now we're good to go! ## A Quick Test¶ Let's test if everything is working as it should be. In your anaconda prompt, within your conda environment, run the following command jupyter lab  This should start the Jupyter Lab server and launch a browser window with the IDE ready to use. Figure 4 - A fresh installation of Jupyter Lab. Let's create a new notebook. In the Launcher tab which has opened by default, click "Python 3" under the Notebook heading. This will create a new and empty notebook named Untitled.ipynb in the current directory. Let's try to import our packages. If everything is configured as it should be, you should see no errors. Type the following into the first cell and click "play" button to execute it and create a new cell. In [1]: import numpy as np # for multi-dimensional containers import pandas as pd # for DataFrames import plotly.express as px # plotly express import plotly.graph_objects as go # for data visualisation import plotly.io as pio # to set shahin plot layout from tensorflow import keras # neural networks API pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=40,l=40))).layout.template pio.templates.default = 'shahin'  If we followed all the instructions and didn't encounter any errors, everything should be working. The last two lines beginning with pio.templates are modifications made to the colours/margins of all figures rendered by Plotly, and do not need to be included. Enter the following code into the new and empty cell (below the first one) to test if Plotly visualisations will render within the notebook. In [13]: tips = px.data.tips() fig = px.histogram(tips, x="total_bill", color="sex") fig.show()  If the Jupyter Lab extension is installed and functioning, you should see an interactive histogram. ## Summary¶ In this section we've downloaded, installed, configured, and tested our environment such that we're ready to run the following examples and experiments. If you ever find that you're missing packages, you can install them in the same way as we installed numpy and the others in this section. # Using a Framework and the ZDT Test Suite ## Preamble¶ In [1]: # used to create block diagrams %reload_ext xdiag_magic %xdiag_output_format svg import numpy as np # for multi-dimensional containers import pandas as pd # for DataFrames import platypus as plat # multi-objective optimisation framework  ## Introduction¶ When preparing to implement multi-objective optimisation experiments, it's often more convenient to use a ready-made framework/library instead of programming everything from scratch. There are many libraries and frameworks that have been implemented in many different programming languages, but as we're using Python we will be selecting from frameworks such as DEAP, PyGMO, and Platypus. With our focus on multi-objective optimisation, our choice is an easy one. We will choose Platypus which has a focus on multi-objective problems and optimisation. Platypus is a framework for evolutionary computing in Python with a focus on multiobjective evolutionary algorithms (MOEAs). It differs from existing optimization libraries, including PyGMO, Inspyred, DEAP, and Scipy, by providing optimization algorithms and analysis tools for multiobjective optimization. As a first look into Platypus, let's repeat the process covered in the earlier section on "Synthetic Objective Functions and ZDT1", where we randomly initialise a solution and then evaluate it using ZDT1. In [2]: %%blockdiag { orientation = portrait "Problem Variables" -> "Test Function" -> "Objective Values" "Test Function" [color = '#ffffcc'] }  ## The ZDT test function¶ Similar to the last time, we will be using a synthetic test problem throughout this notebook called ZDT1. It is part of the ZDT test suite, consisting of six different two-objective synthetic test problems. This is quite an old test suite, easy to solve, and very easy to visualise. Mathematically, the ZDT11 two-objective test function can be expressed as: \begin{aligned} f_1(x_1) &= x_1 \tag{1} \\ f_2(x) &= g \cdot h \\ g(x_2,\ldots,x_{\mathrm{D}}) &= 1 + 9 \cdot \sum_{d=2}^{\mathrm{D}} \frac{x_d}{(D-1)}\\ h(f_1,g) &= 1 - \sqrt{f1/g} \end{aligned} wherex$is a solution to the problem, defined as a vector of$D$decision variables. $$x= \langle x_{1},x_{2},\ldots,x_{\mathrm{D}} \rangle \tag{2}$$ and all decision variables fall between$0$and$1$. $$0 \le x_d \le 1, d=1,\ldots,\mathrm{D} \tag{3}$$ For this bi-objective test function,$f_1$is the first objective, and$f_2$is the second objective. This particular objective function is, by design, scalable up to any number of problem variables but is restricted to two problem objectives. Let's start implementing this in Python, beginning with the initialisation of a solution according to Equations 2 and 3. In this case, we will have 30 problem variables$\mathrm{D}=30$. In [3]: D = 30 x = np.random.rand(D) print(x)  [0.40700054 0.04071341 0.2019138 0.79142779 0.9140935 0.03581037 0.62395881 0.85908972 0.59781233 0.12436693 0.67729958 0.31991784 0.54294362 0.62353553 0.70098228 0.03701772 0.08176195 0.70208965 0.40398321 0.41584089 0.71320878 0.46511698 0.28813617 0.86793829 0.65494094 0.5299217 0.98657349 0.46780006 0.23001727 0.41083113]  Now that we have a solution to evaluate, let's implement the ZDT1 synthetic test function using Equation 1. In [4]: def ZDT1(x): f1 = x[0] # objective 1 g = 1 + 9 * np.sum(x[1:D] / (D-1)) h = 1 - np.sqrt(f1 / g) f2 = g * h # objective 2 return [f1, f2]  Finally, let's invoke our implemented test function using our solution$x$from earlier. In [5]: objective_values = ZDT1(x) print(objective_values)  [0.4070005418391035, 3.9526573425595815]  Now we can see the two objective values that measure our solution$x$according to the ZDT1 synthetic test function, which is a minimisation problem. ## Using a Framework¶ We've quickly repeated our earlier exercise, where we move from our mathematical description of ZDT1 to an implementation in Python. Now, let's use the Platypus' implementation of ZDT1, which would have saves us from having to implement it in Python ourselves. We have already imported Platypus as plat above, so to get an instance of ZDT1 all we need to do is use the object constructor. In [6]: problem = plat.ZDT1()  Just like that, our variable problem references an instance of the ZDT1 test problem. Now we need to create a solution in a structure that is defined by Platypus. This solution object is what Platypus expects when performing all of the operations that it provides. In [7]: solution = plat.Solution(problem)  By using the Solution() constructor and passing in our earlier instantiated problem, the solution is initialised with the correct number of variables and objectives. We can check this ourselves. In [8]: print(f"This solution's variables are set to:\n{solution.variables}") print(f"This solution has {len(solution.variables)} variables")  This solution's variables are set to: [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] This solution has 30 variables  In [9]: print(f"This solution's objectives are set to:\n{solution.objectives}") print(f"This solution has {len(solution.objectives)} objectives")  This solution's objectives are set to: [None, None] This solution has 2 objectives  Earlier in this section we randomly generated 30 problem variables and stored them in the variable x. Let's assign this to our variables and check that it works. In [10]: solution.variables = x print(f"This solution's variables are set to:\n{solution.variables}")  This solution's variables are set to: [0.40700054 0.04071341 0.2019138 0.79142779 0.9140935 0.03581037 0.62395881 0.85908972 0.59781233 0.12436693 0.67729958 0.31991784 0.54294362 0.62353553 0.70098228 0.03701772 0.08176195 0.70208965 0.40398321 0.41584089 0.71320878 0.46511698 0.28813617 0.86793829 0.65494094 0.5299217 0.98657349 0.46780006 0.23001727 0.41083113]  Now we can invoke the evaluate() method which will use the assigned problem to evaluate the problem variables and calculate the objective values. We can print these out afterwards to see the results. In [11]: solution.evaluate() print(solution.objectives)  [0.4070005418391035, 3.9526573425595815]  These objectives values should be the same as the ones that were calculated by our own implementation of ZDT1, within some margin of error. ## Conclusion¶ In this section, we introduced a framework for multi-objective optimisation and analysis. We used it to create an instance of the ZDT1 test problem, which we then used to initialise an empty solution. We then assigned randomly generated problem variables to this solution and evaluated it with the ZDT test function to determine the objective values. Exercise Using the framework introduced in this section, evaluate a number of randomly generated solutions for ZDT2, ZDT3, ZDT4, and ZDT6. 1. E. Zitzler, K. Deb, and L. Thiele. Comparison of Multiobjective Evolutionary Algorithms: Empirical Results. Evolutionary Computation, 8(2):173-195, 2000 # CVSS Exploratory Data Analysis ## Preamble¶ In [1]: # used to create block diagrams %reload_ext xdiag_magic %xdiag_output_format svg import numpy as np # for multi-dimensional containers import pandas as pd # for DataFrames import plotly.graph_objects as go # for data visualisation import plotly.io as pio # to set shahin plot layout from wordcloud import WordCloud # visualising word clouds import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = [10, 10] pio.templates['shahin'] = pio.to_templated(go.Figure().update_layout(legend=dict(orientation="h",y=1.1, x=.5, xanchor='center'),margin=dict(t=0,r=0,b=40,l=40))).layout.template pio.templates.default = 'shahin' pio.renderers.default = "notebook_connected"  ## Dataset¶ In [2]: data = pd.read_csv('https://shahinrostami.com/datasets/enisa_vuln.csv', low_memory=False, index_col='id') data['date_published'] = pd.to_datetime(data['date_published']).dt.date data.tail()  Out[2]: source_db source_db_id cna cvss3_bscore cvss3_severity cvss3_impact cvss3_exploitability cvss3_attack cvss3_complexity cvss3_priveleges ... EOS_product EOS_version EOS_date Type 0day Today 0day_low_y 0day_upper Today_low Today_upper id 27466 nvd.nist.gov CVE-2018-20877 MITRE Corporation 5.4 MEDIUM 2.7 2.3 Network Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 27467 nvd.nist.gov CVE-2018-20876 MITRE Corporation 5.4 MEDIUM 2.7 2.3 Network Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 27468 nvd.nist.gov CVE-2018-20875 MITRE Corporation 5.4 MEDIUM 2.7 2.3 Network Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 27469 nvd.nist.gov CVE-2018-20874 MITRE Corporation 5.4 MEDIUM 2.7 2.3 Network Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 27470 nvd.nist.gov CVE-2018-20873 MITRE Corporation 3.3 LOW 1.4 1.8 Local Low Low ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 rows × 57 columns ### Introduction¶ In [3]: data.info()  <class 'pandas.core.frame.DataFrame'> Int64Index: 27471 entries, 0 to 27470 Data columns (total 57 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 source_db 27471 non-null object 1 source_db_id 27471 non-null object 2 cna 27471 non-null object 3 cvss3_bscore 27471 non-null float64 4 cvss3_severity 27471 non-null object 5 cvss3_impact 27471 non-null float64 6 cvss3_exploitability 27471 non-null float64 7 cvss3_attack 27471 non-null object 8 cvss3_complexity 27471 non-null object 9 cvss3_priveleges 27471 non-null object 10 cvss3_user_interaction 27471 non-null object 11 cvss3_scope 27471 non-null object 12 cvss3_confidentiality 27471 non-null object 13 cvss3_integrity 27471 non-null object 14 cvss3_availability 27471 non-null object 15 cvss2_bscore 27471 non-null float64 16 cvss2_severity 27471 non-null object 17 cvss2_impact 27471 non-null float64 18 cvss2_exploitability 27471 non-null float64 19 cvss2_access 27471 non-null object 20 cvss2_complexity 27471 non-null object 21 cvss2_authentication 27471 non-null object 22 cvss2_confidentiality 27471 non-null object 23 cvss2_integrity 27471 non-null object 24 cvss2_availability 27471 non-null object 25 cwe 27471 non-null object 26 capec 21335 non-null object 27 cpe 27462 non-null object 28 description 27471 non-null object 29 n_exploits 27471 non-null int64 30 technique_id 8077 non-null object 31 tactic 8067 non-null object 32 date_published 27471 non-null object 33 date_modified 27471 non-null object 34 history_summary 308 non-null object 35 date_exploit 2371 non-null object 36 0day_low_x 3390 non-null object 37 0day_high 3353 non-null float64 38 today_low 3389 non-null float64 39 today_high 3389 non-null float64 40 exp_type 2371 non-null object 41 platform 2371 non-null object 42 exp_verified 2371 non-null object 43 vendor 23110 non-null object 44 product 23108 non-null object 45 sector 136 non-null object 46 incident 2169 non-null float64 47 EOS_product 381 non-null object 48 EOS_version 381 non-null object 49 EOS_date 381 non-null object 50 Type 3369 non-null object 51 0day 3369 non-null object 52 Today 3369 non-null object 53 0day_low_y 3369 non-null float64 54 0day_upper 3369 non-null float64 55 Today_low 3369 non-null float64 56 Today_upper 3369 non-null float64 dtypes: float64(14), int64(1), object(42) memory usage: 12.2+ MB  In [4]: print(f"Earliest date {data.date_published.min()}") print(f"Latest date {data.date_published.max()}") print(f"Over {(data.date_published.max() - data.date_published.min()).days} days")  Earliest date 2018-01-01 Latest date 2019-08-30 Over 606 days  Two numerical features, cvss_score3 and cvss_score2. There is a difference in severity classification and base score range between the two metrics. E.g. a score of$> 9.0\$ is classified as "Critical" in CVSS3, but only "High" in CVSS2.

In [5]:
data.describe()

Out[5]:
cvss3_bscore cvss3_impact cvss3_exploitability cvss2_bscore cvss2_impact cvss2_exploitability n_exploits 0day_high today_low today_high incident 0day_low_y 0day_upper Today_low Today_upper
count 27471.000000 27471.000000 27471.000000 27471.000000 27471.000000 27471.000000 27471.000000 3353.000000 3389.000000 3389.000000 2169.000000 3369.000000 3369.0 3369.000000 3369.000000
mean 7.339012 4.416301 2.783615 5.749459 5.229668 8.122318 0.131666 24634.059052 3375.627029 8068.161700 1.946519 13223.211636 inf 3538.438706 8420.302760
std 1.612830 1.479722 0.913636 1.888787 2.552895 2.005292 1.286410 26930.853685 4372.583724 9670.596732 2.327816 18106.114904 NaN 4605.882592 10109.008754
min 1.800000 1.400000 0.100000 1.200000 2.900000 1.200000 0.000000 1000.000000 0.000000 1000.000000 0.000000 0.000000 1000.0 0.000000 1000.000000
25% 6.100000 3.600000 1.800000 4.300000 2.900000 8.000000 0.000000 5000.000000 0.000000 1000.000000 1.000000 2000.000000 5000.0 0.000000 1000.000000
50% 7.500000 3.600000 2.800000 5.000000 4.900000 8.600000 0.000000 25000.000000 2000.000000 5000.000000 2.000000 10000.000000 25000.0 2000.000000 5000.000000
75% 8.800000 5.900000 3.900000 7.200000 6.400000 10.000000 0.000000 25000.000000 5000.000000 10000.000000 2.000000 10000.000000 25000.0 5000.000000 10000.000000
max 10.000000 6.000000 3.900000 10.000000 10.000000 10.000000 123.000000 100000.000000 25000.000000 50000.000000 57.000000 100000.000000 inf 25000.000000 50000.000000

How many missing vulnerability scores?

In [6]:
data.isna().sum()

Out[6]:
source_db                     0
source_db_id                  0
cna                           0
cvss3_bscore                  0
cvss3_severity                0
cvss3_impact                  0
cvss3_exploitability          0
cvss3_attack                  0
cvss3_complexity              0
cvss3_priveleges              0
cvss3_user_interaction        0
cvss3_scope                   0
cvss3_confidentiality         0
cvss3_integrity               0
cvss3_availability            0
cvss2_bscore                  0
cvss2_severity                0
cvss2_impact                  0
cvss2_exploitability          0
cvss2_access                  0
cvss2_complexity              0
cvss2_authentication          0
cvss2_confidentiality         0
cvss2_integrity               0
cvss2_availability            0
cwe                           0
capec                      6136
cpe                           9
description                   0
n_exploits                    0
technique_id              19394
tactic                    19404
date_published                0
date_modified                 0
history_summary           27163
date_exploit              25100
0day_low_x                24081
0day_high                 24118
today_low                 24082
today_high                24082
exp_type                  25100
platform                  25100
exp_verified              25100
vendor                     4361
product                    4363
sector                    27335
incident                  25302
EOS_product               27090
EOS_version               27090
EOS_date                  27090
Type                      24102
0day                      24102
Today                     24102
0day_low_y                24102
0day_upper                24102
Today_low                 24102
Today_upper               24102
dtype: int64
In [7]:
fig = go.Figure()

fig.show()


CVSS distributions

In [8]:
fig = go.Figure()