Data Analysis with Rust Notebooks

A practical book on Data Analysis with Rust Notebooks that teaches you the concepts and how they’re implemented in practice.

Get the book

Unique Array Elements and their Frequency

Preamble

In [2]:
:dep ndarray-csv = {version = "0.4.1"}
:dep ndarray = {version = "0.13.0"}
:dep darn = {version = "0.1.15"}
:dep ureq = {version = "0.11.4"}
:dep itertools = {version = "0.9.0"}
:dep plotly = {version = "0.4.0"}
extern crate csv;
extern crate itertools;

use std::io::prelude::*;
use std::fs::*;
use ndarray::prelude::*;
use ndarray_csv::Array2Reader;
use std::str::FromStr;
use itertools::Itertools;
use plotly::{Plot, Bar, Layout};
use plotly::common::{Mode, Title};
use plotly::layout::{Axis};
use std::collections::HashMap;

Introduction

In this section, we're going to take a look at some approaches to determining the unique elements in a column of data and their frequency. This is a common task which we may often apply to categorical data, and it can easily give us a good idea of the balance of categorical values across our dataset.

If you're familiar with Python and NumPy, you may have encountered the numpy.unique()function that returns a list of sorted unique elements of an array. In more recent versions of NumPy, the numpy.unique() function also takes a parameter named return_counts, which when set to True will also return the number of times each unique element appears in the array.

Let's see how we can do the same for our ndarray::Array2.

Loading our Dataset

We will continue using the Iris Flower dataset, so we need to load it into our raw string array first.

In [3]:
let file_name = "Iris.csv";

let res = ureq::get("https://shahinrostami.com/datasets/Iris.csv").call().into_string()?;

let mut file = File::create(file_name)?;
file.write_all(res.as_bytes());
let mut rdr = csv::Reader::from_path(file_name)?;
remove_file(file_name)?;

let data: Array2<String>= rdr.deserialize_array2_dynamic().unwrap();
let mut headers : Vec<String> = Vec::new();

for element in rdr.headers()?.into_iter() {
        headers.push(String::from(element));
};

Moving Data to Typed Arrays

We need to convert from String to the desired type, and move our data over to the typed arrays.

In [4]:
let mut features: Array2::<f32> =  Array2::<f32>::zeros((data.shape()[0],0));

for &f in [1, 2, 3, 4].iter() {
    features = ndarray::stack![Axis(1), features,
        data.column(f as usize)
            .mapv(|elem| f32::from_str(&elem).unwrap())
            .insert_axis(Axis(1))];
};

let feature_headers = headers[1..5].to_vec();
let labels: Array1::<String> = data.column(5).to_owned();

We will only be using our species labels, stored in labels, throughout the rest of this section.

Unique Elements

We can get the unique elements in our array using the unique() function provided by the itertools crate.

Return an iterator adaptor that filters out elements that have already been produced once during the iteration. Duplicates are detected using hash and equality.

We can then iterate through this and output the elements to the output cell.

In [5]:
for i in labels.iter().unique() {
    println!("{}",i);
};
Iris-setosa
Iris-versicolor
Iris-virginica

We can also use the .format() function on .unique() to print out our unique elements.

In [6]:
labels.iter().unique().format("\n")
Out[6]:
"Iris-setosa"
"Iris-versicolor"
"Iris-virginica"

However, we may want to store these in a vector to use later too. Let's store them in unique_elements.

In [21]:
let unique_elements = labels.iter().cloned().unique().collect_vec();

unique_elements
Out[21]:
["Iris-setosa", "Iris-versicolor", "Iris-virginica"]

Count of Unique Elements

In this case it's easy to see we have three unique elements, but if the list is too long or we wanted to store the value for later use, we can use the len() function on our vector containing the unique elements.

In [22]:
unique_elements.len()
Out[22]:
3

We could also get the count from our iterator.

In [23]:
labels.iter().unique().count()
Out[23]:
3

Frequency of Unique Elements

To find the frequency of a specific string we can use filter() and then count().

In [24]:
labels.iter().filter(|&elem| *elem == "Iris-virginica").count()
Out[24]:
50

We may also want to find the frequency of every unique element and store it in a vector. We'll call this unique_frequency.

In [25]:
let mut unique_frequency = Vec::<usize>::new();

We can populate this by iterating through the unique elements and use the strings to find the frequency using the filter() approach above.

In [26]:
for unique_elem in unique_elements.iter() {
    unique_frequency.push(labels.iter().filter(|&elem| elem == unique_elem).count());
};

unique_frequency
Out[26]:
[50, 50, 50]

Visualise the Frequency of Unique Elements

It's useful to visualise the frequency of elements in our array, especially if we're interested in looking at the balance in our dataset. We can do this with a Bar plot using Plotly.

In [27]:
let layout = Layout::new()
    .yaxis(Axis::new().title(Title::new("Frequency")));

let freq_bars = Bar::new(unique_elements, unique_frequency);

let mut plot = Plot::new();

plot.set_layout(layout);
plot.add_trace(freq_bars);

darn::show_plot(plot)

Unique Elements and their Frequency with Hashmaps

We can do something similar with a HashMap. First we'll define a variable of type HashMap<String, i32>, where the String part will be the unique element, and the i32 will be the frequency.

In [14]:
let mut value_counts : HashMap<String, i32> = HashMap::new();

We can then populate this by iterating through our original labels array.

In [15]:
for item in labels.iter() {
    *value_counts.entry(String::from(item)).or_insert(0) += 1;
};

Printing out the results, we can see that value_counts now contains all the information that we're after.

In [16]:
println!("{:#?}", value_counts);
{
    "Iris-setosa": 50,
    "Iris-virginica": 50,

We can also get these values directly by key.

In [17]:
value_counts.get("Iris-versicolor").unwrap()
    "Iris-versicolor": 50,
}
Out[17]:
50

However, you will need to map this HashMap to separate vectors if you want to use it for plotting with Plotly.

Conclusion

In this section, we've demonstrated a few approaches to identifying the unique elements in an array, counting the number of unique elements, and the frequency of these unique elements. We also visualised the frequency of our unique elements which is useful during exploratory data analysis.

Support this work

You can access this notebook and more by getting the e-book on Data Analysis with Rust Notebooks.