Data is Beautiful
A practical book on data visualisation that shows you how to create static and interactive visualisations that are engaging and beautiful.
Get the book
Arabica Coffee Beans - Origin and Variety
Made with Chord Pro
You can create beautiful interactive visualisations like this one with Chord Pro. Learn how to make beautiful visualisations with the book, Data is Beautiful.
Preamble¶
import itertools
import pandas as pd # for DataFrames
from chord import Chord
Introduction¶
In this section, we're going to be pointing our beautifully colourful lens towards the warm and aromatic world of coffee. In particular, we're going to be visualising the co-occurrence of coffee bean variety and origin in over a thousand coffee reviews.
Note
This section uses the Chord Pro software to create a visualisation. Grab a copy to produce the same output!
The Dataset¶
We're going to use the popular Coffee Quality Institute Database which I have forked on GitHub for posterity. The file arabatica_data.csv
contains the data we'll be using throughout this section, and the first thing we'll want to do is to load the data and output some samples for a sanity check.
data_url = "https://datacrayon.com/datasets/arabica_data.csv"
data = pd.read_csv(data_url)
data.head()
Data Wrangling¶
By viewing the CSV directly we can see our desired columns are named Country.of.Origin
and Variety
. Let's print out the columns to make sure they exist in the data we've loaded.
data.columns
Great! We can see both of these columns exist in our DataFrame.
Now let's take a peek at the unique values in both of these columns to see if any obvious issues stand out. We'll start with the Country.of.Origin
column.
data["Country.of.Origin"].unique()
We can see some points that may cause issues when it comes to our visualisation.
There appears to be at least one nan
value in Country.of.Origin
. We're only interested in coffee bean reviews which aren't missing this data, so let's remove any samples where nan
exists.
data = data[data["Country.of.Origin"].notna()]
Also, the entries in Country.of.Origin
will be used as labels on our visualisation. Ideally, we don't want these to be longer than they need to be. So let's shorten some of the longer names.
data["Country.of.Origin"] = data["Country.of.Origin"].replace(
"United States (Hawaii)", "Hawaii"
)
data["Country.of.Origin"] = data["Country.of.Origin"].replace(
"Tanzania, United Republic Of", "Tanzania"
)
data["Country.of.Origin"] = data["Country.of.Origin"].replace(
"United States (Puerto Rico)", "Puerto Rico"
)
Now let's take a peek at the unique Variety
column.
data["Variety"].unique()
We can see this column also has at least one nan
entry, so let's remove these too.
data = data[data["Variety"].notna()]
Also, there appears to be at least one entry of Other
for the Variety
. For this visualisation, we're not interested in Other
, so let's remove them too.
data = data[data["Variety"] != "Other"]
From previous Chord diagram visualisations we know that they can become too crowded with too many different categories. With this in mind, let's choose to visualise only the top $12$ most frequently occurring Country.of.Origin
and Variety
.
data = data[
data["Country.of.Origin"].isin(
list(data["Country.of.Origin"].value_counts()[:12].index)
)
]
data = data[data["Variety"].isin(list(data["Variety"].value_counts()[:12].index))]
As we're creating a bipartite chord diagram, let's define what labels will be going on the left and the right.
On the left, we'll have all of our countries of origin.
left = list(data["Country.of.Origin"].value_counts().index)[::-1]
pd.DataFrame(left)
And on the right, we'll have all of our varieties.
right = list(data["Variety"].value_counts().index)
pd.DataFrame(right)
We're good to go! So let's select just these two columns and work with a DataFrame containing only them as we move forward.
origin_variety = pd.DataFrame(data[["Country.of.Origin", "Variety"]].values)
origin_variety
Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.
We can build this list of names by adding together the labels for the left and right side of our bipartite diagram.
names = left + right
pd.DataFrame(names)
Now we can create our empty co-occurrence matrix using these type names for the row and column indeces.
matrix = pd.DataFrame(0, index=names, columns=names)
matrix
We can populate a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.
origin_variety = list(
itertools.chain.from_iterable((i, i[::-1]) for i in origin_variety.values)
)
Which we can now use to create the matrix.
for pairing in origin_variety:
matrix.at[pairing[0], pairing[1]] += 1
matrix = matrix.values.tolist()
We can list the DataFrame
for better presentation.
pd.DataFrame(matrix)
Chord Diagram¶
Time to visualise the co-occurrence of items using a chord diagram. We are going to use a list of custom colours that represent the items.
Let's specify some colours for the left and right sides.
colors = [
"#ff575c","#ff914d","#ffca38","#f2fa00","#C3F500","#94f000",
"#00fa68","#00C1A2","#0087db","#0054f0","#5d00e0","#2F06EB",
"#6f1d1b","#955939","#A87748","#bb9457","#7f5e38","#432818",
"#6e4021","#99582a","#cc9f69","#755939","#BAA070","#ffe6a7",]
Chord.user = "email here"
Chord.key = "license key here"
And then we invoke the Chord function passing in our desired customisation arguments.
Chord(
matrix,
names,
colors=colors,
width=900,
padding=0.01,
font_size="12px",
font_size_large="16px",
noun="coffee bean reviews",
title="Coffee Bean Reviews - Variety and Origin",
divide=True,
divide_idx=len(left),
divide_size=0.6,
allow_download=True,
).show()
Conclusion¶
In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a bipartite chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!
Made with Chord Pro
You can create beautiful interactive visualisations like this one with Chord Pro. Learn how to make beautiful visualisations with the book, Data is Beautiful.
Data is Beautiful
A practical book on data visualisation that shows you how to create static and interactive visualisations that are engaging and beautiful.
Get the book