Data Analysis with Rust Notebooks

A practical book on Data Analysis with Rust Notebooks that teaches you the concepts and how they’re implemented in practice.

Get the book

Box Plots at the Olympics

Preamble

In [2]:
:dep darn = {version = "0.1.15"}
:dep ndarray = {version = "0.13.1"}
:dep itertools = {version = "0.9.0"}
:dep plotly = {version = "0.4.0"}
extern crate ndarray;

use ndarray::prelude::*;
use std::str::FromStr;
use itertools::Itertools;
use plotly::{Plot, Layout, BoxPlot};
use plotly::common::{Title, Font};
use plotly::layout::{Margin, Axis};

Introduction

In this section, we're going to use 120 years of Olympic history to create two visualisations. Let's set our sights on something that illustrates the age and height in athletes grouped by the different Olympic games.

Basketball cat

The Dataset

We'll use the 120 years of Olympic history: athletes and results dataset, which we'll download and load with the darn crate. You're also welcome to use the mirrored that has been used in the following cell.

In [3]:
let data = darn::read_csv("https://shahinrostami.com/datasets/athlete_events_known_age.csv");

We'll take a peek at what we've downloaded to make sure there were no issues with the loading.

In [4]:
darn::show_frame(&data.0, Some(&data.1));
Out[4]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
"1" "A Dijiang" "M" "24" "180" "80" "China" "CHN" "1992 Summer" "1992" "Summer" "Barcelona" "Basketball" "Basketball Men\'s Basketball" "NA"
"2" "A Lamusi" "M" "23" "170" "60" "China" "CHN" "2012 Summer" "2012" "Summer" "London" "Judo" "Judo Men\'s Extra-Lightweight" "NA"
"5" "Christine Jacoba Aaftink" "F" "21" "185" "82" "Netherlands" "NED" "1988 Winter" "1988" "Winter" "Calgary" "Speed Skating" "Speed Skating Women\'s 500 metres" "NA"
"5" "Christine Jacoba Aaftink" "F" "21" "185" "82" "Netherlands" "NED" "1988 Winter" "1988" "Winter" "Calgary" "Speed Skating" "Speed Skating Women\'s 1,000 metres" "NA"
"5" "Christine Jacoba Aaftink" "F" "25" "185" "82" "Netherlands" "NED" "1992 Winter" "1992" "Winter" "Albertville" "Speed Skating" "Speed Skating Women\'s 500 metres" "NA"
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
"135569" "Andrzej ya" "M" "29" "179" "89" "Poland-1" "POL" "1976 Winter" "1976" "Winter" "Innsbruck" "Luge" "Luge Mixed (Men)\'s Doubles" "NA"
"135570" "Piotr ya" "M" "27" "176" "59" "Poland" "POL" "2014 Winter" "2014" "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men\'s Large Hill, Individual" "NA"
"135570" "Piotr ya" "M" "27" "176" "59" "Poland" "POL" "2014 Winter" "2014" "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men\'s Large Hill, Team" "NA"
"135571" "Tomasz Ireneusz ya" "M" "30" "185" "96" "Poland" "POL" "1998 Winter" "1998" "Winter" "Nagano" "Bobsleigh" "Bobsleigh Men\'s Four" "NA"
"135571" "Tomasz Ireneusz ya" "M" "34" "185" "96" "Poland" "POL" "2002 Winter" "2002" "Winter" "Salt Lake City" "Bobsleigh" "Bobsleigh Men\'s Four" "NA"

It looks like the data was loaded without any issues.

Data Wrangling

Let's assign the feature data to games and feature names to headers for readability.

In [5]:
let games = data.0;
let headers = data.1;

A quick look at the available features will give us the feature names we're after for the age and height of athletes.

In [6]:
println!("{}", &headers.iter().format("\n"));
ID
Name
Sex
Age
Height
Weight

We've confirmed that the two features we're after are named Age and Height, and that they're at index $3$ and $4$. However, it would be better to determine these indices programmatically instead of hard-coding them.

In [7]:
let idx_age = headers.iter().position(|x| x == "Age").unwrap();
let idx_height = headers.iter().position(|x| x == "Height").unwrap();
Team
NOC
Games
Year
Season
City
Sport
Event
Medal

Let's create an array of these indices and print them out to check.

In [8]:
let selected_features = [idx_age,idx_height];

println!("{}",selected_features.iter().format("\n"));
3
4

Now that we know the index of our age and height columns, let's prepare two collection variables, one named features to hold the numeric feature data, and one named feature_headers to hold the corresponding column names.

In [9]:
let mut features: Array2::<f32> =  Array2::<f32>::zeros((games.shape()[0],0));
let mut feature_headers = Vec::<String>::new();

Now, we can copy and parse our feature data into initialised collections.

In [10]:
for &feature_index in selected_features.iter() {
    feature_headers.push(headers[feature_index].clone());
    features = ndarray::stack![Axis(1), features,
        games.column(feature_index as usize)
            .mapv(|elem| elem.parse::<f32>().unwrap())
            .insert_axis(Axis(1))
    ];
};

We'll take a peek to make sure there were no obvious issues with parsing.

In [11]:
darn::show_frame(&features, Some(&feature_headers));
Out[11]:
Age Height
24.0 180.0
23.0 170.0
21.0 185.0
21.0 185.0
25.0 185.0
... ...
29.0 179.0
27.0 176.0
27.0 176.0
30.0 185.0
34.0 185.0

Looking good. Next, we'll need to determine the different games available in our dataset - we'll be using these to group the age and height data.

In [12]:
let idx_sport = headers.iter().position(|x| x == "Sport").unwrap();
let unique_games = games.column(idx_sport).iter().cloned().unique().collect_vec();

println!("{}",unique_games.iter().format(", "));
Basketball, Judo, Speed Skating, Cross Country Skiing, Athletics, Ice Hockey, Badminton, Sailing, Biathlon, Gymnastics, Alpine Skiing, Handball, Weightlifting, Wrestling, Luge, Rowing, Bobsleigh, Swimming, Football, Equestrianism, Shooting, Taekwondo, Boxing, Fencing, Diving, Canoeing, Water Polo, Tennis, Cycling, Hockey, Figure Skating, Softball, Archery, Volleyball, Synchronized Swimming, Modern Pentathlon, Table Tennis, Nordic Combined, Baseball, Rhythmic Gymnastics, Freestyle Skiing, Rugby Sevens, Trampolining, Beach Volleyball, Triathlon, Ski Jumping, Curling, Golf, Snowboarding, Short Track Speed Skating, Skeleton, Rugby, Art Competitions, Tug-Of-War

We now have the unique list of Olympic games - some of which you may not even have heard of!

Visualising the Data

Now that we have prepared our data, let's use all of our hard work in a box plot test.

Height of Athletes in Basketball

Let's see if we can create a box plot for the height of athletes in Basketball. To do so, we're going to build a list of row indices that correspond to Basketball data.

In [13]:
let mut count = -1;
let mut indices = Vec::<usize>::new();

let mask = games.column(idx_sport).map(|elem| {
    count += 1;    
    if(elem == "Basketball") { indices.push(count as usize) };
    elem == "Basketball"
    }
);

Then, we'll use these indices to select from our feature data.

In [14]:
let basketball = features.select(Axis(0), &indices);

We'll take a peek to make sure there were no obvious issues with parsing.

In [15]:
darn::show_frame(&basketball, Some(&feature_headers));
Out[15]:
Age Height
24.0 180.0
19.0 185.0
29.0 195.0
25.0 189.0
23.0 178.0
... ...
30.0 218.0
20.0 201.0
28.0 201.0
23.0 202.0
33.0 171.0

Finally, we'll create a box plot with just the height of the athletes in our dataset.

In [16]:
let mut plot = Plot::new();

let trace = BoxPlot::new(basketball.column(1).to_vec()).name("Basketball");

plot.add_trace(trace);

darn::show_plot(plot);
Out[16]:

Looking good.

Athlete Height Grouped by Olympic Games

Now let's do the same as what we've just done for Basketball, but apply it to all the games in our dataset.

In [42]:
let mut plot = Plot::new();
let layout = Layout::new()
    .title(Title::new("Athlete height grouped by Olympic games."))
    .margin(Margin::new().left(30).right(0).bottom(140).top(40))
    .xaxis(Axis::new().show_grid(true).tick_font(Font::new().size(10)))
    .show_legend(false);

plot.set_layout(layout);

for name in unique_games.iter() {
    let mut count = -1;
    let mut indices = Vec::<usize>::new();
    let mask = games.column(idx_sport).map(|elem| {
        count += 1;    
        if(elem == name) { indices.push(count as usize) };
        elem == "name"
        }
    );

    let game = features.select(Axis(0), &indices);
    let trace1 = BoxPlot::new(game.column(1).to_vec()).name(name);
    plot.add_trace(trace1);
};

darn::show_plot(plot);