US Mortality - Race and Manner of Death

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from chord import Chord

Introduction

In previous sections, we visualised co-occurrences of Pokémon type. Whilst it was interesting to look at, the dataset only contained Pokémon from the first six geerations. In this section, we're going to use the TidyTuesday Animal Crossing villagers dataset to visualise the relationship between Species and .

The Dataset

The dataset documentation states that we can expect 13 variables per each of the 1017 Pokémon of the first eight generations.

Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = '/Users/shahin/Documents/devel/data/2015_data.csv'
data = pd.read_csv(data_url)
data.head()
/Users/shahin/opt/miniconda3/envs/analytics/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (40,41,42,43,61,62,63,64) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[2]:
resident_status education_1989_revision education_2003_revision education_reporting_flag month_of_death sex detail_age_type detail_age age_substitution_flag age_recode_52 ... record_condition_18 record_condition_19 record_condition_20 race bridged_race_flag race_imputation_flag race_recode_3 race_recode_5 hispanic_origin hispanic_originrace_recode
0 1 NaN 3.0 1 1 M 1 84 NaN 42 ... NaN NaN NaN 1 NaN NaN 1 1 100 6
1 1 NaN 6.0 1 1 M 1 70 NaN 40 ... NaN NaN NaN 1 NaN NaN 1 1 100 6
2 1 NaN 3.0 1 1 F 1 91 NaN 44 ... NaN NaN NaN 1 NaN NaN 1 1 100 6
3 1 NaN 3.0 1 1 F 1 40 NaN 34 ... NaN NaN NaN 3 NaN NaN 2 3 100 8
4 1 NaN 5.0 1 1 F 1 89 NaN 43 ... NaN NaN NaN 1 NaN NaN 1 1 100 6

5 rows × 77 columns

In [3]:
data['race_recode_5'].value_counts()
Out[3]:
1    2311103
2     320759
4      67295
3      19041
Name: race_recode_5, dtype: int64
In [4]:
data.columns
Out[4]:
Index(['resident_status', 'education_1989_revision', 'education_2003_revision',
       'education_reporting_flag', 'month_of_death', 'sex', 'detail_age_type',
       'detail_age', 'age_substitution_flag', 'age_recode_52', 'age_recode_27',
       'age_recode_12', 'infant_age_recode_22',
       'place_of_death_and_decedents_status', 'marital_status',
       'day_of_week_of_death', 'current_data_year', 'injury_at_work',
       'manner_of_death', 'method_of_disposition', 'autopsy', 'activity_code',
       'place_of_injury_for_causes_w00_y34_except_y06_and_y07_',
       'icd_code_10th_revision', '358_cause_recode', '113_cause_recode',
       '130_infant_cause_recode', '39_cause_recode',
       'number_of_entity_axis_conditions', 'entity_condition_1',
       'entity_condition_2', 'entity_condition_3', 'entity_condition_4',
       'entity_condition_5', 'entity_condition_6', 'entity_condition_7',
       'entity_condition_8', 'entity_condition_9', 'entity_condition_10',
       'entity_condition_11', 'entity_condition_12', 'entity_condition_13',
       'entity_condition_14', 'entity_condition_15', 'entity_condition_16',
       'entity_condition_17', 'entity_condition_18', 'entity_condition_19',
       'entity_condition_20', 'number_of_record_axis_conditions',
       'record_condition_1', 'record_condition_2', 'record_condition_3',
       'record_condition_4', 'record_condition_5', 'record_condition_6',
       'record_condition_7', 'record_condition_8', 'record_condition_9',
       'record_condition_10', 'record_condition_11', 'record_condition_12',
       'record_condition_13', 'record_condition_14', 'record_condition_15',
       'record_condition_16', 'record_condition_17', 'record_condition_18',
       'record_condition_19', 'record_condition_20', 'race',
       'bridged_race_flag', 'race_imputation_flag', 'race_recode_3',
       'race_recode_5', 'hispanic_origin', 'hispanic_originrace_recode'],
      dtype='object')

capitalise the name, personality, and species of each villager.

In [5]:
#data['manner'] = data['manner_of_death']#.str.capitalize()
#data['race_recode_5'] = data['race_recode_5']#.str.capitalize()
#data['species'] = data['species'].str.capitalize()

It looks good so far, but let's confirm the 13 variables against 1017 samples from the documentation.

In [6]:
data.shape
Out[6]:
(2718198, 77)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns Type 1 and Type 2.

In [7]:
pd.DataFrame(data.columns.values.tolist())
Out[7]:
0
0 resident_status
1 education_1989_revision
2 education_2003_revision
3 education_reporting_flag
4 month_of_death
... ...
72 race_imputation_flag
73 race_recode_3
74 race_recode_5
75 hispanic_origin
76 hispanic_originrace_recode

77 rows × 1 columns

So let's select just these two columns and work with a list containing only them as we move forward.

In [8]:
data.fillna(0, inplace=True)
In [ ]:
 
In [9]:
data.iloc[6572].manner_of_death
Out[9]:
7.0
In [11]:
data.manner_of_death.value_counts()
Out[11]:
7.0    2107352
0.0     388364
1.0     143961
2.0      44417
3.0      18885
5.0      11054
4.0       4165
Name: manner_of_death, dtype: int64
In [ ]:
{'143961': 'Accident',
 '44417': 'Suicide',
 '18885': 'Homicide',
 '4165': 'Pending investigation',
 '11054': 'Could not determine',
 '2107352': 'Natural',
 '388364': 'Not specified'}
In [10]:
import json
In [11]:
with open("/Users/shahin/Documents/devel/data/2015_data.json", "r") as read_file:
    codes = json.load(read_file)
In [12]:
codes['manner_of_death']
Out[12]:
{'1': 'Accident',
 '2': 'Suicide',
 '3': 'Homicide',
 '4': 'Pending investigation',
 '5': 'Could not determine',
 '7': 'Natural',
 'Blank': 'Not specified'}
In [13]:
codes['manner_of_death']['0'] = codes['manner_of_death'].pop('Blank')
In [14]:
remove = ["Natural", "Not specified", "Could not determine", "Pending investigation"]
In [15]:
list(codes['manner_of_death'].values())
Out[15]:
['Accident',
 'Suicide',
 'Homicide',
 'Pending investigation',
 'Could not determine',
 'Natural',
 'Not specified']
In [16]:
left = list(codes['race_recode_5'].values())
pd.DataFrame(left)
Out[16]:
0
0 White
1 Black
2 American Indian
3 Asian or Pacific Islander
In [17]:
right = list(codes['manner_of_death'].values())
pd.DataFrame(right)
Out[17]:
0
0 Accident
1 Suicide
2 Homicide
3 Pending investigation
4 Could not determine
5 Natural
6 Not specified
In [18]:
right = [x for x in right if x not in remove]
In [19]:
data['manner_of_death'] = data['manner_of_death'].astype('int32')
data['manner_of_death'] = data['manner_of_death'].astype('str')

data.iloc[6572].manner_of_death
Out[19]:
'7'
In [20]:
left.sort()
right.sort()
In [21]:
left
Out[21]:
['American Indian', 'Asian or Pacific Islander', 'Black', 'White']
In [22]:
data = data.replace({"manner_of_death": codes['manner_of_death']})
In [23]:
data['race_recode_5'] = data['race_recode_5'].astype('int32')
data['race_recode_5'] = data['race_recode_5'].astype('str')

data.iloc[6572].race_recode_5
Out[23]:
'2'
In [24]:
data = data.replace({"race_recode_5": codes['race_recode_5']})
In [25]:
data.iloc[6572].race_recode_5
Out[25]:
'Black'
In [26]:
manner_race = pd.DataFrame(data[['manner_of_death', 'race_recode_5']].values)
manner_race
Out[26]:
0 1
0 Natural White
1 Natural White
2 Natural White
3 Homicide American Indian
4 Natural White
... ... ...
2718193 Natural Black
2718194 Natural White
2718195 Natural White
2718196 Natural Black
2718197 Natural Black

2718198 rows × 2 columns

Now for the names of our types.

Which we can now use to create the matrix.

In [27]:
features= left+right
d = pd.DataFrame(0, index=features, columns=features)

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.

In [28]:
manner_race = list(itertools.chain.from_iterable((i, i[::-1]) for i in manner_race.values))
In [29]:
for x in manner_race:
    if(x[0] not in remove and x[1] not in remove):
        d.at[x[0], x[1]] += 1
In [30]:
d
Out[30]:
American Indian Asian or Pacific Islander Black White Accident Homicide Suicide
American Indian 0 0 0 0 2067 304 582
Asian or Pacific Islander 0 0 0 0 3032 364 1334
Black 0 0 0 0 15746 9419 2518
White 0 0 0 0 123116 8798 39983
Accident 2067 3032 15746 123116 0 0 0
Homicide 304 364 9419 8798 0 0 0
Suicide 582 1334 2518 39983 0 0 0
In [31]:
for race in left:
    d.loc[ race , : ] = ((d.loc[ race , : ] / d.loc[ race , : ].sum()) * 100)
    d.loc[ : , race  ] = ((d.loc[  : , race  ] / d.loc[  : , race  ].sum()) * 100)
In [32]:
(d[race].value_counts(normalize=True)*100).astype(int)
Out[32]:
0.000000     57
71.621960    14
23.259859    14
5.118181     14
Name: White, dtype: int64
In [33]:
d
Out[33]:
American Indian Asian or Pacific Islander Black White Accident Homicide Suicide
American Indian 0.000000 0.00000 0.000000 0.000000 69.996614 10.294616 19.708771
Asian or Pacific Islander 0.000000 0.00000 0.000000 0.000000 64.101480 7.695560 28.202960
Black 0.000000 0.00000 0.000000 0.000000 56.879673 34.024492 9.095835
White 0.000000 0.00000 0.000000 0.000000 71.621960 5.118181 23.259859
Accident 69.996614 64.10148 56.879673 71.621960 0.000000 0.000000 0.000000
Homicide 10.294616 7.69556 34.024492 5.118181 0.000000 0.000000 0.000000
Suicide 19.708771 28.20296 9.095835 23.259859 0.000000 0.000000 0.000000

Chord Diagram

Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.

In [34]:
colors = ["#2DE1FC","#1883B4","#C5DB66","#90B64D","#DB2B39","#E76926", "#DB9118"]
In [35]:
names = left + right

Finally, we can put it all together.

In [ ]:
 
In [ ]:
 
In [36]:
names[1]= "Asian or PI"


names[0]= "AIAN"
In [37]:
d
Out[37]:
American Indian Asian or Pacific Islander Black White Accident Homicide Suicide
American Indian 0.000000 0.00000 0.000000 0.000000 69.996614 10.294616 19.708771
Asian or Pacific Islander 0.000000 0.00000 0.000000 0.000000 64.101480 7.695560 28.202960
Black 0.000000 0.00000 0.000000 0.000000 56.879673 34.024492 9.095835
White 0.000000 0.00000 0.000000 0.000000 71.621960 5.118181 23.259859
Accident 69.996614 64.10148 56.879673 71.621960 0.000000 0.000000 0.000000
Homicide 10.294616 7.69556 34.024492 5.118181 0.000000 0.000000 0.000000
Suicide 19.708771 28.20296 9.095835 23.259859 0.000000 0.000000 0.000000

Finally, we can put it all together but this time with the details matrix passed in.

In [38]:
Chord(
    d.round(2).values.tolist(),
    names,
    colors=colors,
    credit=True,
      wrap_labels=True,
      margin=50, 
    font_size_large=7,
divide=True,
    noun="percent",
    divide_idx=len(left),
    divide_size=.2,
    width=850).show()
Chord Diagram

Chord Diagram with Names

It would be nice to show a list of Pokémon names when hovering over co-occurring Pokémon types. To do this, we can make use of the optional details parameter.

In [42]:
d
Out[42]:
American Indian Asian or Pacific Islander Black White Accident Homicide Suicide
American Indian 0.000000 0.00000 0.000000 0.000000 69.996614 10.294616 19.708771
Asian or Pacific Islander 0.000000 0.00000 0.000000 0.000000 64.101480 7.695560 28.202960
Black 0.000000 0.00000 0.000000 0.000000 56.879673 34.024492 9.095835
White 0.000000 0.00000 0.000000 0.000000 71.621960 5.118181 23.259859
Accident 69.996614 64.10148 56.879673 71.621960 0.000000 0.000000 0.000000
Homicide 10.294616 7.69556 34.024492 5.118181 0.000000 0.000000 0.000000
Suicide 19.708771 28.20296 9.095835 23.259859 0.000000 0.000000 0.000000

Next, we'll create an empty multi-dimensional array with the same shape as our matrix.

In [39]:
details = np.empty((len(names),len(names)),dtype=object)
details_thumbs = np.empty((len(names),len(names)),dtype=object)

Now we can populate the details array with lists of Pokémon names in the correct positions.

In [40]:
for count_x, item_x in enumerate(names):
    for count_y, item_y in enumerate(names):
        details_urls = data[
            (data['species'].isin([item_x, item_y])) &
            (data['personality'].isin([item_y, item_x]))]['url'].to_list()
        
        details_names = data[
            (data['species'].isin([item_x, item_y])) &
            (data['personality'].isin([item_y, item_x]))]['name'].to_list()
        
        urls_names = np.column_stack((details_urls, details_names))
        if(urls_names.size > 0):
            details[count_x][count_y] = details_names
            details_thumbs[count_x][count_y] = details_urls

        else:
            details[count_x][count_y] = []
            details_thumbs[count_x][count_y] = []

details=pd.DataFrame(details).values.tolist()
details_thumbs=pd.DataFrame(details_thumbs).values.tolist()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/opt/miniconda3/envs/analytics/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'species'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-40-f54e3a0b2906> in <module>
      2     for count_y, item_y in enumerate(names):
      3         details_urls = data[
----> 4             (data['species'].isin([item_x, item_y])) &
      5             (data['personality'].isin([item_y, item_x]))]['url'].to_list()
      6 

~/opt/miniconda3/envs/analytics/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2798             if self.columns.nlevels > 1:
   2799                 return self._getitem_multilevel(key)
-> 2800             indexer = self.columns.get_loc(key)
   2801             if is_integer(indexer):
   2802                 indexer = [indexer]

~/opt/miniconda3/envs/analytics/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'species'
In [ ]:
len(right)
In [ ]:
Chord(d.values.tolist(), names,credit=True, colors=colors, wrap_labels=False,
      margin=40, font_size_large=7,details=details,details_thumbs=details_thumbs,noun="villagers",
        details_separator="", divide=True, divide_idx=len(left),divide_size=.2, width=850).show()
In [ ]:
np.empty(shape=(6,1)).tolist()

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Visualisation of Co-occurring Types

Preamble

In [2]:
:dep darn = {version = "0.1.11"}
:dep ndarray = {version = "0.13.0"}
:dep itertools = {version = "0.9.0"}
:dep chord = {Version = "0.1.6"}
extern crate ndarray;

use ndarray::prelude::*;
use itertools::Itertools;
use chord::{Chord, Plot};

Introduction

In this section, we're going to use the Complete Pokemon Dataset dataset to visualise the co-occurrence of Pokémon types from generations one to eight.

The Dataset

The dataset documentation states that we can expect two type variables per each of the 1028 samples of the first eight generations, type_1, and type_2.

Let's download the mirrored dataset and have a look for ourselves.

In [3]:
let data = darn::read_csv("https://shahinrostami.com/datasets/pokemon_gen_1_to_8.csv");
In [4]:
darn::show_frame(&data.0, Some(&data.1));
Out[4]:
pokedex_number name german_name japanese_name generation status species type_number type_1 type_2 height_m weight_kg abilities_number ability_1 ability_2 ability_hidden total_points hp attack defense sp_attack sp_defense speed catch_rate base_friendship base_experience growth_rate egg_type_number egg_type_1 egg_type_2 percentage_male egg_cycles against_normal against_fire against_water against_electric against_grass against_ice against_fight against_poison against_ground against_flying against_psychic against_bug against_rock against_ghost against_dragon against_dark against_steel against_fairy
"0" "1" "Bulbasaur" "Bisasam" "フシギダネ (Fushigidane)" "1" "Normal" "Seed Pokémon" "2" "Grass" "Poison" "0.7" "6.9" "2" "Overgrow" "" "Chlorophyll" "318" "45" "49" "49" "65" "65" "45" "45" "70" "64" "Medium Slow" "2" "Grass" "Monster" "87.5" "20" "1" "2" "0.5" "0.5" "0.25" "2" "0.5" "1" "1" "2" "2" "1" "1" "1" "1" "1" "1" "0.5"
"1" "2" "Ivysaur" "Bisaknosp" "フシギソウ (Fushigisou)" "1" "Normal" "Seed Pokémon" "2" "Grass" "Poison" "1" "13" "2" "Overgrow" "" "Chlorophyll" "405" "60" "62" "63" "80" "80" "60" "45" "70" "142" "Medium Slow" "2" "Grass" "Monster" "87.5" "20" "1" "2" "0.5" "0.5" "0.25" "2" "0.5" "1" "1" "2" "2" "1" "1" "1" "1" "1" "1" "0.5"
"2" "3" "Venusaur" "Bisaflor" "フシギバナ (Fushigibana)" "1" "Normal" "Seed Pokémon" "2" "Grass" "Poison" "2" "100" "2" "Overgrow" "" "Chlorophyll" "525" "80" "82" "83" "100" "100" "80" "45" "70" "236" "Medium Slow" "2" "Grass" "Monster" "87.5" "20" "1" "2" "0.5" "0.5" "0.25" "2" "0.5" "1" "1" "2" "2" "1" "1" "1" "1" "1" "1" "0.5"
"3" "3" "Mega Venusaur" "Bisaflor" "フシギバナ (Fushigibana)" "1" "Normal" "Seed Pokémon" "2" "Grass" "Poison" "2.4" "155.5" "1" "Thick Fat" "" "" "625" "80" "100" "123" "122" "120" "80" "45" "70" "281" "Medium Slow" "2" "Grass" "Monster" "87.5" "20" "1" "1" "0.5" "0.5" "0.25" "1" "0.5" "1" "1" "2" "2" "1" "1" "1" "1" "1" "1" "0.5"
"4" "4" "Charmander" "Glumanda" "ヒトカゲ (Hitokage)" "1" "Normal" "Lizard Pokémon" "1" "Fire" "" "0.6" "8.5" "2" "Blaze" "" "Solar Power" "309" "39" "52" "43" "60" "50" "65" "45" "70" "62" "Medium Slow" "2" "Dragon" "Monster" "87.5" "20" "1" "0.5" "2" "1" "0.5" "0.5" "1" "1" "2" "1" "1" "0.5" "2" "1" "1" "1" "0.5" "0.5"
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
"1023" "888" "Zacian Hero of Many Battles" "" "" "8" "Legendary" "Warrior Pokémon" "1" "Fairy" "" "2.8" "110" "1" "Intrepid Sword" "" "" "670" "92" "130" "115" "80" "115" "138" "" "" "" "Slow" "1" "Undiscovered" "" "" "120" "1" "1" "1" "1" "1" "1" "0.5" "2" "1" "1" "1" "0.5" "1" "1" "0" "0.5" "2" "1"
"1024" "889" "Zamazenta Crowned Shield" "" "" "8" "Legendary" "Warrior Pokémon" "2" "Fighting" "Steel" "2.9" "785" "1" "Dauntless Shield" "" "" "720" "92" "130" "145" "80" "145" "128" "" "" "" "Slow" "1" "Undiscovered" "" "" "120" "0.5" "2" "1" "1" "0.5" "0.5" "2" "0" "2" "1" "1" "0.25" "0.25" "1" "0.5" "0.5" "0.5" "1"
"1025" "889" "Zamazenta Hero of Many Battles" "" "" "8" "Legendary" "Warrior Pokémon" "1" "Fighting" "" "2.9" "210" "1" "Dauntless Shield" "" "" "670" "92" "130" "115" "80" "115" "138" "" "" "" "Slow" "1" "Undiscovered" "" "" "120" "1" "1" "1" "1" "1" "1" "1" "1" "1" "2" "2" "0.5" "0.5" "1" "1" "0.5" "1" "2"
"1026" "890" "Eternatus" "" "" "8" "Legendary" "Gigantic Pokémon" "2" "Poison" "Dragon" "20" "950" "1" "Pressure" "" "" "690" "140" "85" "95" "145" "95" "130" "" "" "" "Slow" "1" "Undiscovered" "" "" "120" "1" "0.5" "0.5" "0.5" "0.25" "2" "0.5" "0.5" "2" "1" "2" "0.5" "1" "1" "2" "1" "1" "1"
"1027" "890" "Eternatus Eternamax" "" "" "8" "Legendary" "Gigantic Pokémon" "2" "Poison" "Dragon" "100" "" "0" "" "" "" "1125" "255" "115" "250" "125" "250" "130" "" "" "" "Slow" "1" "Undiscovered" "" "" "120" "1" "0.5" "0.5" "0.5" "0.25" "2" "0.5" "0.5" "2" "1" "2" "0.5" "1" "1" "2" "1" "1" "1"

It looks good so far, we can clearly see the two type columns. Let's confirm that we have 1028 samples.

In [5]:
&data.0.shape()
Out[5]:
[1028, 51]

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the column names that the Pokémon types are split between the columns type_1 and type_2.

In [6]:
&data.1
Out[6]:
["", "pokedex_number", "name", "german_name", "japanese_name", "generation", "status", "species", "type_number", "type_1", "type_2", "height_m", "weight_kg", "abilities_number", "ability_1", "ability_2", "ability_hidden", "total_points", "hp", "attack", "defense", "sp_attack", "sp_defense", "speed", "catch_rate", "base_friendship", "base_experience", "growth_rate", "egg_type_number", "egg_type_1", "egg_type_2", "percentage_male", "egg_cycles", "against_normal", "against_fire", "against_water", "against_electric", "against_grass", "against_ice", "against_fight", "against_poison", "against_ground", "against_flying", "against_psychic", "against_bug", "against_rock", "against_ghost", "against_dragon", "against_dark", "against_steel", "against_fairy"]

So let's select just these two columns and work with a list containing only them as we move forward.

In [7]:
let types = data.0.slice(s![.., 9..11]).into_owned();
darn::show_frame(&types, None);
Out[7]:
"Grass" "Poison"
"Grass" "Poison"
"Grass" "Poison"
"Grass" "Poison"
"Fire" ""
... ...
"Fairy" ""
"Fighting" "Steel"
"Fighting" ""
"Poison" "Dragon"
"Poison" "Dragon"

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

First, we'll populate our list of type names by looking for the unique ones.

In [8]:
let mut names = types.iter().cloned().unique().collect_vec();
names
Out[8]:
["Grass", "Poison", "Fire", "", "Flying", "Dragon", "Water", "Bug", "Normal", "Dark", "Electric", "Psychic", "Ground", "Ice", "Steel", "Fairy", "Fighting", "Rock", "Ghost"]

Let's sort this alphabetically.

In [9]:
names.sort();
names
Out[9]:
["", "Bug", "Dark", "Dragon", "Electric", "Fairy", "Fighting", "Fire", "Flying", "Ghost", "Grass", "Ground", "Ice", "Normal", "Poison", "Psychic", "Rock", "Steel", "Water"]

We'll also remove the empty string that has appeared as a result of samples with only one type.

In [10]:
names.remove(0);
names
Out[10]:
["Bug", "Dark", "Dragon", "Electric", "Fairy", "Fighting", "Fire", "Flying", "Ghost", "Grass", "Ground", "Ice", "Normal", "Poison", "Psychic", "Rock", "Steel", "Water"]

Now we can create our empty co-occurrence matrix with a shape that can hold co-occurrences between our types.

In [11]:
let type_count = names.len();
let mut matrix: Vec<Vec<f64>> = vec![vec![Default::default(); type_count]; type_count];
matrix
Out[11]:
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]

We can populate a co-occurrence matrix with the following approach. Here, we're looping through every sample in our dataset and incrementing the corresponding matrix entry by one using the type_1 and type_2 indices from the names vector. To make sure we have a co-occurrence matrix, we're also doing the same in reverse, i.e. type_2 and type_1.

In [29]:
for item in types.genrows() { 
    if(!item[0].is_empty() && !item[1].is_empty()) {
        matrix[names.iter().position(|s| s == &item[1]).unwrap()]
              [names.iter().position(|s| s == &item[0]).unwrap()] += 1.0;
        matrix[names.iter().position(|s| s == &item[0]).unwrap()]
              [names.iter().position(|s| s == &item[1]).unwrap()] += 1.0;
    };
};

Chord Diagram

Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.

In [26]:
let colors: Vec<String> = vec![
    "#A6B91A", "#705746", "#6F35FC", "#F7D02C", "#D685AD",
    "#C22E28", "#EE8130", "#A98FF3", "#735797", "#7AC74C",
    "#E2BF65", "#96D9D6", "#A8A77A", "#A33EA1", "#F95587",
    "#B6A136", "#B7B7CE", "#6390F0"
]
.into_iter()
.map(String::from)
.collect();

Finally, we can put it all together.

In [27]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    colors: colors,
    margin: 30.0,
    wrap_labels: true,
    ..Chord::default()
}
.show();
Out[27]:
Chord Diagram

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Co-occurring Pokémon Types

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from chord import Chord

Introduction

In previous sections, we visualised co-occurrences of Pokémon type. Whilst it was interesting to look at, the dataset only contained Pokémon from the first six geerations. In this section, we're going to use the Pokemon with stats Generation 8 dataset to visualise the co-occurrence of Pokémon types from generations one to eight.

The Dataset

The dataset documentation states that we can expect 13 variables per each of the 1017 Pokémon of the first eight generations.

Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://shahinrostami.com/datasets/pokemon_gen_1_to_8.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
Unnamed: 0 pokedex_number name german_name japanese_name generation status species type_number type_1 ... against_ground against_flying against_psychic against_bug against_rock against_ghost against_dragon against_dark against_steel against_fairy
0 0 1 Bulbasaur Bisasam フシギダネ (Fushigidane) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
1 1 2 Ivysaur Bisaknosp フシギソウ (Fushigisou) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
2 2 3 Venusaur Bisaflor フシギバナ (Fushigibana) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
3 3 3 Mega Venusaur Bisaflor フシギバナ (Fushigibana) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
4 4 4 Charmander Glumanda ヒトカゲ (Hitokage) 1 Normal Lizard Pokémon 1 Fire ... 2.0 1.0 1.0 0.5 2.0 1.0 1.0 1.0 0.5 0.5

5 rows × 51 columns

It looks good so far, but let's confirm the 13 variables against 1017 samples from the documentation.

In [3]:
data.shape
Out[3]:
(1028, 51)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns Type 1 and Type 2.

In [4]:
pd.DataFrame(data.columns.values.tolist())
Out[4]:
0
0 Unnamed: 0
1 pokedex_number
2 name
3 german_name
4 japanese_name
5 generation
6 status
7 species
8 type_number
9 type_1
10 type_2
11 height_m
12 weight_kg
13 abilities_number
14 ability_1
15 ability_2
16 ability_hidden
17 total_points
18 hp
19 attack
20 defense
21 sp_attack
22 sp_defense
23 speed
24 catch_rate
25 base_friendship
26 base_experience
27 growth_rate
28 egg_type_number
29 egg_type_1
30 egg_type_2
31 percentage_male
32 egg_cycles
33 against_normal
34 against_fire
35 against_water
36 against_electric
37 against_grass
38 against_ice
39 against_fight
40 against_poison
41 against_ground
42 against_flying
43 against_psychic
44 against_bug
45 against_rock
46 against_ghost
47 against_dragon
48 against_dark
49 against_steel
50 against_fairy

So let's select just these two columns and work with a list containing only them as we move forward.

In [5]:
types = pd.DataFrame(data[['type_1', 'type_2']].values)
types
Out[5]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
3 Grass Poison
4 Fire NaN
... ... ...
1023 Fairy NaN
1024 Fighting Steel
1025 Fighting NaN
1026 Poison Dragon
1027 Poison Dragon

1028 rows × 2 columns

Without further investigation, we can see that we have at least a few NaN values in the table above. We are only interested in co-occurrence of types, so we can remove all samples which contain a NaN value.

In [6]:
types = types.dropna()

We can also see an instance where the type Fighting at index $1014$ is followed by \n. We'll strip all these out before continuing.

In [7]:
types = types.replace('\n','', regex=True)
types
Out[7]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
3 Grass Poison
6 Fire Flying
... ... ...
1021 Dragon Ghost
1022 Fairy Steel
1024 Fighting Steel
1026 Poison Dragon
1027 Poison Dragon

542 rows × 2 columns

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

First we'll populate our list of type names by looking for the unique ones.

In [8]:
names = np.unique(types).tolist()
pd.DataFrame(names)
Out[8]:
0
0 Bug
1 Dark
2 Dragon
3 Electric
4 Fairy
5 Fighting
6 Fire
7 Flying
8 Ghost
9 Grass
10 Ground
11 Ice
12 Normal
13 Poison
14 Psychic
15 Rock
16 Steel
17 Water

Now we can create our empty co-occurrence matrix using these type names for the row and column indeces.

In [9]:
matrix = pd.DataFrame(0, index=names, columns=names)
matrix
Out[9]:
Bug Dark Dragon Electric Fairy Fighting Fire Flying Ghost Grass Ground Ice Normal Poison Psychic Rock Steel Water
Bug 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Dark 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Dragon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Electric 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fairy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fighting 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fire 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Flying 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ghost 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Grass 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ground 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ice 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Normal 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Poison 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Psychic 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Rock 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Steel 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Water 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We can populate a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.

In [10]:
types = list(itertools.chain.from_iterable((i, i[::-1]) for i in types.values))

Which we can now use to create the matrix.

In [11]:
for t in types:
    matrix.at[t[0], t[1]] += 1
    
matrix = matrix.values.tolist()

We can list DataFrame for better presentation.

In [12]:
pd.DataFrame(matrix)
Out[12]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 0 0 0 4 2 4 4 14 1 6 2 2 0 13 2 5 7 5
1 0 0 4 2 3 3 4 5 3 3 3 2 5 5 3 2 2 7
2 0 4 0 3 1 2 3 8 5 5 9 3 1 4 5 2 2 3
3 4 2 3 0 2 0 1 6 1 1 1 2 2 3 1 3 4 3
4 2 3 1 2 0 0 0 2 1 5 0 1 5 1 9 3 5 4
5 4 3 2 0 0 0 7 1 1 3 0 1 4 2 6 1 4 3
6 4 4 3 1 0 7 0 7 5 0 4 1 2 2 3 3 1 1
7 14 5 8 6 2 1 7 0 3 7 4 2 27 3 7 6 3 8
8 1 3 5 1 1 1 5 3 0 12 6 1 0 4 3 0 4 2
9 6 3 5 1 5 3 0 7 12 0 1 3 2 15 3 2 3 3
10 2 3 9 1 0 0 4 4 6 1 0 3 1 2 2 9 6 10
11 2 2 3 2 1 1 1 2 1 3 3 0 0 0 4 2 2 7
12 0 5 1 2 5 4 2 27 0 2 1 0 0 0 5 0 0 1
13 13 5 4 3 1 2 2 3 4 15 2 0 0 0 0 1 0 6
14 2 3 5 1 9 6 3 7 3 3 2 4 5 0 0 2 9 6
15 5 2 2 3 3 1 3 6 0 2 9 2 0 1 2 0 7 11
16 7 2 2 4 5 4 1 3 4 3 6 2 0 0 9 7 0 1
17 5 7 3 3 4 3 1 8 2 3 10 7 1 6 6 11 1 0

Chord Diagram

Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.

In [13]:
colors = ["#A6B91A", "#705746", "#6F35FC", "#F7D02C", "#D685AD",
          "#C22E28", "#EE8130", "#A98FF3", "#735797", "#7AC74C",
          "#E2BF65", "#96D9D6", "#A8A77A", "#A33EA1", "#F95587",
          "#B6A136", "#B7B7CE", "#6390F0"];
In [14]:
names
Out[14]:
['Bug',
 'Dark',
 'Dragon',
 'Electric',
 'Fairy',
 'Fighting',
 'Fire',
 'Flying',
 'Ghost',
 'Grass',
 'Ground',
 'Ice',
 'Normal',
 'Poison',
 'Psychic',
 'Rock',
 'Steel',
 'Water']

Finally, we can put it all together.

In [15]:
Chord(matrix, names, colors=colors).show()
Chord Diagram

Chord Diagram with Names

It would be nice to show a list of Pokémon names and images when hovering over co-occurring Pokémon types. To do this, we can make use of the optional details parameter.

Let's clean up our dataset by removing all instances of \n.

In [16]:
data = data.replace('\n','', regex=True)

Let's also add a column to our dataset to store URLs that point to the images.

In [17]:
data['URL'] = ""

for index, row in data.iterrows():
    dex = f"{row['pokedex_number']:03d}"
    url = f"https://shahinrostami.com/images/data-is-beautiful/pokemon_thumbs/{dex}.png"
    data.at[index,'URL'] = url

Next, we'll create an empty multi-dimensional arrays with the same shape as our matrix for our details and thumbnail images.

In [18]:
details = np.empty((len(names),len(names)),dtype=object)
details_thumbs = np.empty((len(names),len(names)),dtype=object)

Now we can populate the details array with lists of Pokémon names in the correct positions.

In [19]:
for count_x, item_x in enumerate(names):
    for count_y, item_y in enumerate(names):
        details_urls = data[
            (data['type_1'].isin([item_x, item_y])) &
            (data['type_2'].isin([item_y, item_x]))]['URL'].to_list()
        
        details_names = data[
            (data['type_1'].isin([item_x, item_y])) &
            (data['type_2'].isin([item_y, item_x]))]['name'].to_list()
        
        urls_names = np.column_stack((details_urls, details_names))
        if(urls_names.size > 0):
            details[count_x][count_y] = details_names
            details_thumbs[count_x][count_y] = details_urls

        else:
            details[count_x][count_y] = []
            details_thumbs[count_x][count_y] = []

details=pd.DataFrame(details).values.tolist()
details_thumbs=pd.DataFrame(details_thumbs).values.tolist()

Finally, we can put it all together but this time with the details matrix passed in.

In [20]:
Chord(
    matrix,
    names,
    colors=colors,
    details=details,
    details_thumbs=details_thumbs,
    noun="Pokémon",
    thumbs_width=70,
    thumbs_margin=1,
    popup_width=600,
    thumbs_font_size=10,
    credit=True
).show()
Chord Diagram

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Animal Crossing Villagers - Species and Personalities

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from chord import Chord

Introduction

In previous sections, we visualised co-occurrences of Pokémon type. Whilst it was interesting to look at, the dataset only contained Pokémon from the first six geerations. In this section, we're going to use the TidyTuesday Animal Crossing villagers dataset to visualise the relationship between Species and .

The Dataset

The dataset documentation states that we can expect 13 variables per each of the 1017 Pokémon of the first eight generations.

Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://shahinrostami.com/datasets/ac_villagers.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
id name gender species birthday personality song phrase full_id url
0 admiral Admiral male bird Jan-27 cranky Steep Hill aye aye villager-admiral https://shahinrostami.com/images/data-is-beaut...
1 agent-s Agent S female squirrel 07-Feb peppy DJ K.K. sidekick villager-agent-s https://shahinrostami.com/images/data-is-beaut...
2 agnes Agnes female pig Apr-21 uchi K.K. House snuffle villager-agnes https://shahinrostami.com/images/data-is-beaut...
3 al Al male gorilla Oct-18 lazy Steep Hill Ayyeeee villager-al https://shahinrostami.com/images/data-is-beaut...
4 alfonso Alfonso male alligator 06-Sep lazy Forest Life it'sa me villager-alfonso https://shahinrostami.com/images/data-is-beaut...

capitalise the name, personality, and species of each villager.

In [3]:
data['name'] = data['name'].str.capitalize()
data['personality'] = data['personality'].str.capitalize()
data['species'] = data['species'].str.capitalize()

It looks good so far, but let's confirm the 13 variables against 1017 samples from the documentation.

In [4]:
data.shape
Out[4]:
(391, 10)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns Type 1 and Type 2.

In [5]:
pd.DataFrame(data.columns.values.tolist())
Out[5]:
0
0 id
1 name
2 gender
3 species
4 birthday
5 personality
6 song
7 phrase
8 full_id
9 url

So let's select just these two columns and work with a list containing only them as we move forward.

In [6]:
species_personality = pd.DataFrame(data[['species', 'personality']].values)
species_personality
Out[6]:
0 1
0 Bird Cranky
1 Squirrel Peppy
2 Pig Uchi
3 Gorilla Lazy
4 Alligator Lazy
... ... ...
386 Horse Peppy
387 Wolf Cranky
388 Koala Snooty
389 Deer Smug
390 Octopus Lazy

391 rows × 2 columns

Now for the names of our types.

In [7]:
left = np.unique(pd.DataFrame(species_personality)[0]).tolist()
pd.DataFrame(left)
Out[7]:
0
0 Alligator
1 Anteater
2 Bear
3 Bird
4 Bull
5 Cat
6 Chicken
7 Cow
8 Cub
9 Deer
10 Dog
11 Duck
12 Eagle
13 Elephant
14 Frog
15 Goat
16 Gorilla
17 Hamster
18 Hippo
19 Horse
20 Kangaroo
21 Koala
22 Lion
23 Monkey
24 Mouse
25 Octopus
26 Ostrich
27 Penguin
28 Pig
29 Rabbit
30 Rhino
31 Sheep
32 Squirrel
33 Tiger
34 Wolf
In [8]:
right = np.unique(pd.DataFrame(species_personality)[1]).tolist()
pd.DataFrame(right)
Out[8]:
0
0 Cranky
1 Jock
2 Lazy
3 Normal
4 Peppy
5 Smug
6 Snooty
7 Uchi

Which we can now use to create the matrix.

In [9]:
features= left+right
d = pd.DataFrame(0, index=features, columns=features)
In [ ]:
 

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.

In [10]:
species_personality = list(itertools.chain.from_iterable((i, i[::-1]) for i in species_personality.values))
In [11]:
for x in species_personality:
    d.at[x[0], x[1]] += 1
In [12]:
d
Out[12]:
Alligator Anteater Bear Bird Bull Cat Chicken Cow Cub Deer ... Tiger Wolf Cranky Jock Lazy Normal Peppy Smug Snooty Uchi
Alligator 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 2 2 1 0 0 1 0
Anteater 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 1 2 1 1 0
Bear 0 0 0 0 0 0 0 0 0 0 ... 0 0 5 1 1 1 2 2 0 3
Bird 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 4 2 1 2 2 1 0
Bull 0 0 0 0 0 0 0 0 0 0 ... 0 0 3 1 2 0 0 0 0 0
Cat 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 3 3 3 5 1 5 1
Chicken 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 2 1 0 1 2 1
Cow 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 1 0 2 0
Cub 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 2 4 4 2 0 1 1
Deer 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 2 1 0 2 1 2
Dog 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 6 3 2 1 1 1
Duck 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 2 4 2 4 1 4 0
Eagle 0 0 0 0 0 0 0 0 0 0 ... 0 0 4 2 0 1 0 1 1 0
Elephant 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 4 3 0 0 2 0
Frog 0 0 0 0 0 0 0 0 0 0 ... 0 0 3 5 3 2 1 2 1 1
Goat 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 1 2 0 1 1 1
Gorilla 0 0 0 0 0 0 0 0 0 0 ... 0 0 3 2 1 0 0 1 1 1
Hamster 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 1 1 1 2 1 0
Hippo 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 1 0 1 1 1 1 0
Horse 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 3 2 2 3 2 1
Kangaroo 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 0 0 3 0 0 2 1
Koala 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 1 3 0 1 1 1
Lion 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 3 1 0 0 2 0 0
Monkey 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 2 1 1 0 1 1
Mouse 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 3 1 2 4 1 2 0
Octopus 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 1 1 0 0 0 0
Ostrich 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 1 2 1 1 3 1
Penguin 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 2 4 1 1 1 2 1
Pig 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 3 2 3 2 1 1 1
Rabbit 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 2 4 1 8 1 2 1
Rhino 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 1 2 0 0 0 1
Sheep 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 3 1 2 4 2
Squirrel 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 1 1 5 3 1 4 1
Tiger 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 3 0 0 2 0 1 0
Wolf 0 0 0 0 0 0 0 0 0 0 ... 0 0 5 0 0 1 1 1 3 0
Cranky 1 1 5 1 3 2 1 0 2 1 ... 1 5 0 0 0 0 0 0 0 0
Jock 2 1 1 4 1 3 1 0 2 1 ... 3 0 0 0 0 0 0 0 0 0
Lazy 2 0 1 2 2 3 2 0 4 2 ... 0 0 0 0 0 0 0 0 0 0
Normal 1 1 1 1 0 3 1 1 4 1 ... 0 1 0 0 0 0 0 0 0 0
Peppy 0 2 2 2 0 5 0 1 2 0 ... 2 1 0 0 0 0 0 0 0 0
Smug 0 1 2 2 0 1 1 0 0 2 ... 0 1 0 0 0 0 0 0 0 0
Snooty 1 1 0 1 0 5 2 2 1 1 ... 1 3 0 0 0 0 0 0 0 0
Uchi 0 0 3 0 0 1 1 0 1 2 ... 0 0 0 0 0 0 0 0 0 0

43 rows × 43 columns

Chord Diagram

Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.

In [13]:
colors =["#ff2200","#ffcc00","#ace6cb","#0057d9","#633366","#73341d","#665f00","#00ffcc","#001433","#e6acda","#ffa280","#eeff00","#336663","#001f73","#ff00aa","#ffd9bf","#f2ffbf","#36ced9","#737399","#73003d","#ff8800","#44ff00","#00a2f2","#6600ff","#ff0044","#99754d","#416633","#004d73","#5e008c","#bf606c","#332200","#60bf60","#acd2e6","#e680ff","#66333a","#3d005c","#6e0060","#99005d","#bd0055","#db2f48","#f05738","#fc7e23","#ffa600"]
In [14]:
names = left + right

Finally, we can put it all together.

In [15]:
Chord(d.values.tolist(), names,colors=colors, wrap_labels=False, margin=40, font_size_large=10).show()
Chord Diagram
In [16]:
Chord(d.values.tolist(), names,credit=True, colors=colors, wrap_labels=False,
      margin=40, font_size_large=7,noun="villagers",
        details_separator="", divide=True, divide_idx=len(left),divide_size=.2, width=850).show()
Chord Diagram

Chord Diagram with Names

It would be nice to show a list of Pokémon names when hovering over co-occurring Pokémon types. To do this, we can make use of the optional details parameter.

In [ ]:
 

Next, we'll create an empty multi-dimensional array with the same shape as our matrix.

In [17]:
details = np.empty((len(names),len(names)),dtype=object)
details_thumbs = np.empty((len(names),len(names)),dtype=object)

Now we can populate the details array with lists of Pokémon names in the correct positions.

In [18]:
for count_x, item_x in enumerate(names):
    for count_y, item_y in enumerate(names):
        details_urls = data[
            (data['species'].isin([item_x, item_y])) &
            (data['personality'].isin([item_y, item_x]))]['url'].to_list()
        
        details_names = data[
            (data['species'].isin([item_x, item_y])) &
            (data['personality'].isin([item_y, item_x]))]['name'].to_list()
        
        urls_names = np.column_stack((details_urls, details_names))
        if(urls_names.size > 0):
            details[count_x][count_y] = details_names
            details_thumbs[count_x][count_y] = details_urls

        else:
            details[count_x][count_y] = []
            details_thumbs[count_x][count_y] = []

details=pd.DataFrame(details).values.tolist()
details_thumbs=pd.DataFrame(details_thumbs).values.tolist()
In [19]:
len(right)
Out[19]:
8

Finally, we can put it all together but this time with the details matrix passed in.

In [23]:
Chord(d.values.tolist(), names,credit=True, colors=colors, wrap_labels=False,
      margin=40, padding=0, font_size_large=7,details=details,details_thumbs=details_thumbs,noun="villagers",
        details_separator="", divide=True, divide_idx=len(left),divide_size=.2, width=850).show()
Chord Diagram
In [21]:
np.empty(shape=(6,1)).tolist()
Out[21]:
[[2.291755454583e-312],
 [2.22809558106e-312],
 [2.143215749443e-312],
 [2.37663528627e-312],
 [2.29175545472e-312],
 [0.0]]
In [22]:
d
Out[22]:
Alligator Anteater Bear Bird Bull Cat Chicken Cow Cub Deer ... Tiger Wolf Cranky Jock Lazy Normal Peppy Smug Snooty Uchi
Alligator 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 2 2 1 0 0 1 0
Anteater 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 1 2 1 1 0
Bear 0 0 0 0 0 0 0 0 0 0 ... 0 0 5 1 1 1 2 2 0 3
Bird 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 4 2 1 2 2 1 0
Bull 0 0 0 0 0 0 0 0 0 0 ... 0 0 3 1 2 0 0 0 0 0
Cat 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 3 3 3 5 1 5 1
Chicken 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 2 1 0 1 2 1
Cow 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 1 0 2 0
Cub 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 2 4 4 2 0 1 1
Deer 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 2 1 0 2 1 2
Dog 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 6 3 2 1 1 1
Duck 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 2 4 2 4 1 4 0
Eagle 0 0 0 0 0 0 0 0 0 0 ... 0 0 4 2 0 1 0 1 1 0
Elephant 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 4 3 0 0 2 0
Frog 0 0 0 0 0 0 0 0 0 0 ... 0 0 3 5 3 2 1 2 1 1
Goat 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 1 2 0 1 1 1
Gorilla 0 0 0 0 0 0 0 0 0 0 ... 0 0 3 2 1 0 0 1 1 1
Hamster 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 1 1 1 2 1 0
Hippo 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 1 0 1 1 1 1 0
Horse 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 3 2 2 3 2 1
Kangaroo 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 0 0 3 0 0 2 1
Koala 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 1 3 0 1 1 1
Lion 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 3 1 0 0 2 0 0
Monkey 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 2 1 1 0 1 1
Mouse 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 3 1 2 4 1 2 0
Octopus 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 1 1 0 0 0 0
Ostrich 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 1 2 1 1 3 1
Penguin 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 2 4 1 1 1 2 1
Pig 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 3 2 3 2 1 1 1
Rabbit 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 2 4 1 8 1 2 1
Rhino 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 1 2 0 0 0 1
Sheep 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 3 1 2 4 2
Squirrel 0 0 0 0 0 0 0 0 0 0 ... 0 0 2 1 1 5 3 1 4 1
Tiger 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 3 0 0 2 0 1 0
Wolf 0 0 0 0 0 0 0 0 0 0 ... 0 0 5 0 0 1 1 1 3 0
Cranky 1 1 5 1 3 2 1 0 2 1 ... 1 5 0 0 0 0 0 0 0 0
Jock 2 1 1 4 1 3 1 0 2 1 ... 3 0 0 0 0 0 0 0 0 0
Lazy 2 0 1 2 2 3 2 0 4 2 ... 0 0 0 0 0 0 0 0 0 0
Normal 1 1 1 1 0 3 1 1 4 1 ... 0 1 0 0 0 0 0 0 0 0
Peppy 0 2 2 2 0 5 0 1 2 0 ... 2 1 0 0 0 0 0 0 0 0
Smug 0 1 2 2 0 1 1 0 0 2 ... 0 1 0 0 0 0 0 0 0 0
Snooty 1 1 0 1 0 5 2 2 1 1 ... 1 3 0 0 0 0 0 0 0 0
Uchi 0 0 3 0 0 1 1 0 1 2 ... 0 0 0 0 0 0 0 0 0 0

43 rows × 43 columns

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Co-occurrence of Anime Genres with Chord Diagrams

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from ast import literal_eval
from chordx import Chord

Introduction

In this section, we're going to use the MyAnimeList dataset to visualise the co-occurrence of anime genres.

The Dataset

The dataset documentation states that we can expect 31 variables per each of the 14478 entries. Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://datacrayon.com/datasets/anime_list.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
anime_id title title_english title_japanese title_synonyms image_url type source episodes status ... background premiered broadcast related producer licensor studio genre opening_theme ending_theme
0 11013 Inu x Boku SS Inu X Boku Secret Service 妖狐×僕SS Youko x Boku SS https://myanimelist.cdn-dena.com/images/anime/... TV Manga 12 Finished Airing ... Inu x Boku SS was licensed by Sentai Filmworks... Winter 2012 Fridays at Unknown {'Adaptation': [{'mal_id': 17207, 'type': 'man... Aniplex, Square Enix, Mainichi Broadcasting Sy... Sentai Filmworks David Production Comedy, Supernatural, Romance, Shounen ['"Nirvana" by MUCC'] ['#1: "Nirvana" by MUCC (eps 1, 11-12)', '#2: ...
1 2104 Seto no Hanayome My Bride is a Mermaid 瀬戸の花嫁 The Inland Sea Bride https://myanimelist.cdn-dena.com/images/anime/... TV Manga 26 Finished Airing ... NaN Spring 2007 Unknown {'Adaptation': [{'mal_id': 759, 'type': 'manga... TV Tokyo, AIC, Square Enix, Sotsu Funimation Gonzo Comedy, Parody, Romance, School, Shounen ['"Romantic summer" by SUN&LUNAR'] ['#1: "Ashita e no Hikari (明日への光)" by Asuka Hi...
2 5262 Shugo Chara!! Doki Shugo Chara!! Doki しゅごキャラ!!どきっ Shugo Chara Ninenme, Shugo Chara! Second Year https://myanimelist.cdn-dena.com/images/anime/... TV Manga 51 Finished Airing ... NaN Fall 2008 Unknown {'Adaptation': [{'mal_id': 101, 'type': 'manga... TV Tokyo, Sotsu NaN Satelight Comedy, Magic, School, Shoujo ['#1: "Minna no Tamago (みんなのたまご)" by Shugo Cha... ['#1: "Rottara Rottara (ロッタラ ロッタラ)" by Buono! ...
3 721 Princess Tutu Princess Tutu プリンセスチュチュ NaN https://myanimelist.cdn-dena.com/images/anime/... TV Original 38 Finished Airing ... Princess Tutu aired in two parts. The first pa... Summer 2002 Fridays at Unknown {'Adaptation': [{'mal_id': 1581, 'type': 'mang... Memory-Tech, GANSIS, Marvelous AQL ADV Films Hal Film Maker Comedy, Drama, Magic, Romance, Fantasy ['"Morning Grace" by Ritsuko Okazaki'] ['"Watashi No Ai Wa Chiisaikeredo" by Ritsuko ...
4 12365 Bakuman. 3rd Season Bakuman. バクマン。 Bakuman Season 3 https://myanimelist.cdn-dena.com/images/anime/... TV Manga 25 Finished Airing ... NaN Fall 2012 Unknown {'Adaptation': [{'mal_id': 9711, 'type': 'mang... NHK, Shueisha NaN J.C.Staff Comedy, Drama, Romance, Shounen ['#1: "Moshimo no Hanashi (もしもの話)" by nano.RIP... ['#1: "Pride on Everyday" by Sphere (eps 1-13)...

5 rows × 31 columns

It looks good so far, but let's confirm the 31 variables against 14478 samples from the documentation.

In [3]:
data.shape
Out[3]:
(14478, 31)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the column names there's a single column for genres, containing comma separated values.

Let's convert them to lists of strings.

In [4]:
def get_list(x):
    if isinstance(x, int):
        return []
    if isinstance(x,str):
        result = [s.strip() for s in x.split(',')]
        return sorted(result)

    return []
In [5]:
genres = data['genre'].apply(get_list)
pd.DataFrame(genres)
Out[5]:
genre
0 [Comedy, Romance, Shounen, Supernatural]
1 [Comedy, Parody, Romance, School, Shounen]
2 [Comedy, Magic, School, Shoujo]
3 [Comedy, Drama, Fantasy, Magic, Romance]
4 [Comedy, Drama, Romance, Shounen]
... ...
14473 [Kids]
14474 [Comedy]
14475 [Action, Adventure, Fantasy, Sci-Fi]
14476 [Fantasy, Kids]
14477 [Comedy]

14478 rows × 1 columns

Without further investigation, we can see that we have at least a few empty list values, [], and a few single-entry lists in the table above, so let's remove all samples which contain an empty or single-entry list.

In [6]:
genres = genres[genres.str.len() > 1]
pd.DataFrame(genres)
Out[6]:
genre
0 [Comedy, Romance, Shounen, Supernatural]
1 [Comedy, Parody, Romance, School, Shounen]
2 [Comedy, Magic, School, Shoujo]
3 [Comedy, Drama, Fantasy, Magic, Romance]
4 [Comedy, Drama, Romance, Shounen]
... ...
14467 [Drama, Kids]
14469 [Kids, School]
14471 [Drama, Fantasy, Kids]
14475 [Action, Adventure, Fantasy, Sci-Fi]
14476 [Fantasy, Kids]

10974 rows × 1 columns

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by cgetting all combinations within each list.

In [7]:
genres = [list(itertools.combinations(i,2)) for i in genres]
pd.DataFrame(genres)
Out[7]:
0 1 2 3 4 5 6 7 8 9 ... 68 69 70 71 72 73 74 75 76 77
0 (Comedy, Romance) (Comedy, Shounen) (Comedy, Supernatural) (Romance, Shounen) (Romance, Supernatural) (Shounen, Supernatural) None None None None ... None None None None None None None None None None
1 (Comedy, Parody) (Comedy, Romance) (Comedy, School) (Comedy, Shounen) (Parody, Romance) (Parody, School) (Parody, Shounen) (Romance, School) (Romance, Shounen) (School, Shounen) ... None None None None None None None None None None
2 (Comedy, Magic) (Comedy, School) (Comedy, Shoujo) (Magic, School) (Magic, Shoujo) (School, Shoujo) None None None None ... None None None None None None None None None None
3 (Comedy, Drama) (Comedy, Fantasy) (Comedy, Magic) (Comedy, Romance) (Drama, Fantasy) (Drama, Magic) (Drama, Romance) (Fantasy, Magic) (Fantasy, Romance) (Magic, Romance) ... None None None None None None None None None None
4 (Comedy, Drama) (Comedy, Romance) (Comedy, Shounen) (Drama, Romance) (Drama, Shounen) (Romance, Shounen) None None None None ... None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10969 (Drama, Kids) None None None None None None None None None ... None None None None None None None None None None
10970 (Kids, School) None None None None None None None None None ... None None None None None None None None None None
10971 (Drama, Fantasy) (Drama, Kids) (Fantasy, Kids) None None None None None None None ... None None None None None None None None None None
10972 (Action, Adventure) (Action, Fantasy) (Action, Sci-Fi) (Adventure, Fantasy) (Adventure, Sci-Fi) (Fantasy, Sci-Fi) None None None None ... None None None None None None None None None None
10973 (Fantasy, Kids) None None None None None None None None None ... None None None None None None None None None None

10974 rows × 78 columns

Now we will flatten the nested lists, this will give us all the genre pairings in original and reversed order.

In [8]:
genres = list(itertools.chain.from_iterable((i, i[::-1]) for c_ in genres for i in c_))
pd.DataFrame(genres)
Out[8]:
0 1
0 Comedy Romance
1 Romance Comedy
2 Comedy Shounen
3 Shounen Comedy
4 Comedy Supernatural
... ... ...
119691 Sci-Fi Adventure
119692 Fantasy Sci-Fi
119693 Sci-Fi Fantasy
119694 Fantasy Kids
119695 Kids Fantasy

119696 rows × 2 columns

Which we can now use to create the matrix.

In [9]:
matrix = pd.pivot_table(
    pd.DataFrame(genres), index=0, columns=1, aggfunc="size", fill_value=0
).values.tolist()

We can list this using a DataFrame for better presentation.

In [10]:
pd.DataFrame(matrix)
Out[10]:
0 1 2 3 4 5 6 7 8 9 ... 33 34 35 36 37 38 39 40 41 42
0 0 1022 34 985 12 165 550 201 860 103 ... 13 38 232 141 390 475 32 56 3 2
1 1022 0 22 994 6 97 428 67 1054 80 ... 7 70 134 39 136 229 6 10 0 0
2 34 22 0 11 0 0 12 0 1 8 ... 0 2 1 36 0 1 0 0 0 0
3 985 994 11 0 28 103 499 500 952 65 ... 41 835 79 246 249 439 11 52 12 11
4 12 6 0 28 0 1 21 3 18 1 ... 1 1 2 1 0 16 4 0 0 0
5 165 97 0 103 1 0 41 24 166 8 ... 0 3 1 1 25 194 4 11 0 1
6 550 428 12 499 21 41 0 54 369 19 ... 37 336 137 148 58 236 36 23 23 1
7 201 67 0 500 3 24 54 0 136 11 ... 0 35 10 22 45 97 0 13 0 8
8 860 1054 1 952 18 166 369 136 0 101 ... 11 132 17 11 110 360 7 33 0 3
9 103 80 8 65 1 8 19 11 101 0 ... 0 20 3 7 1 14 6 0 0 0
10 73 19 0 225 0 18 53 156 74 1 ... 1 20 6 1 10 55 0 11 0 2
11 41 15 0 52 2 57 28 0 68 0 ... 0 1 4 3 5 63 0 3 11 32
12 224 214 1 195 3 56 335 7 184 4 ... 7 82 3 6 15 135 3 3 2 0
13 139 62 1 71 30 74 88 14 94 2 ... 3 1 11 2 10 201 21 35 1 0
14 17 8 0 39 0 4 33 0 18 3 ... 3 29 0 3 0 17 0 3 0 0
15 134 476 23 596 1 22 224 3 588 34 ... 0 122 19 35 36 48 0 5 0 0
16 316 278 0 407 0 57 131 89 538 19 ... 4 60 5 0 52 168 7 7 0 0
17 236 103 0 101 1 20 41 26 72 3 ... 1 15 2 20 86 26 0 0 0 1
18 623 302 10 224 8 4 188 35 73 16 ... 0 13 182 13 44 22 1 0 1 0
19 304 90 0 61 3 14 193 19 52 5 ... 0 9 137 4 18 31 2 6 1 1
20 64 35 3 137 51 4 114 9 89 7 ... 3 113 28 19 6 23 0 5 3 1
21 206 154 0 210 11 19 140 10 79 14 ... 5 24 6 3 44 188 48 23 0 0
22 88 27 2 453 8 6 6 40 55 15 ... 3 29 11 11 44 18 1 3 1 2
23 105 73 5 114 2 1 29 6 3 0 ... 1 5 3 5 1 11 9 1 0 0
24 64 24 0 32 29 6 117 6 32 16 ... 4 12 7 2 2 74 38 0 1 0
25 277 244 2 838 6 64 606 243 306 13 ... 45 241 45 39 32 224 9 24 21 3
26 114 38 0 53 0 10 39 8 23 0 ... 0 1 0 0 11 21 0 1 2 0
27 227 43 1 831 3 25 238 209 125 29 ... 13 361 7 135 62 140 4 16 3 6
28 1143 695 10 676 20 32 464 110 262 40 ... 6 72 377 41 143 98 18 6 3 2
29 256 105 21 373 2 20 144 90 65 12 ... 0 153 24 40 26 102 16 13 0 3
30 76 77 0 242 1 31 191 0 205 1 ... 15 119 1 23 9 66 1 14 0 0
31 13 1 0 33 1 0 20 14 9 0 ... 0 22 0 0 2 4 0 0 0 2
32 809 675 19 963 1 73 266 114 431 63 ... 0 85 63 276 183 249 7 24 0 0
33 13 7 0 41 1 0 37 0 11 0 ... 0 10 0 2 1 16 1 4 1 0
34 38 70 2 835 1 3 336 35 132 20 ... 10 0 7 39 8 93 1 1 2 0
35 232 134 1 79 2 1 137 10 17 3 ... 0 7 0 3 9 7 0 0 0 0
36 141 39 36 246 1 1 148 22 11 7 ... 2 39 3 0 11 3 0 0 2 0
37 390 136 0 249 0 25 58 45 110 1 ... 1 8 9 11 0 90 3 8 0 0
38 475 229 1 439 16 194 236 97 360 14 ... 16 93 7 3 90 0 39 86 2 0
39 32 6 0 11 4 4 36 0 7 6 ... 1 1 0 0 3 39 0 1 0 0
40 56 10 0 52 0 11 23 13 33 0 ... 4 1 0 0 8 86 1 0 0 0
41 3 0 0 12 0 0 23 0 0 0 ... 1 2 0 2 0 2 0 0 0 0
42 2 0 0 11 0 1 1 8 3 0 ... 0 0 0 0 0 0 0 0 0 0

43 rows × 43 columns

Now for the names of our genres.

In [11]:
names = np.unique(genres).tolist()
pd.DataFrame(names)
Out[11]:
0
0 Action
1 Adventure
2 Cars
3 Comedy
4 Dementia
5 Demons
6 Drama
7 Ecchi
8 Fantasy
9 Game
10 Harem
11 Hentai
12 Historical
13 Horror
14 Josei
15 Kids
16 Magic
17 Martial Arts
18 Mecha
19 Military
20 Music
21 Mystery
22 Parody
23 Police
24 Psychological
25 Romance
26 Samurai
27 School
28 Sci-Fi
29 Seinen
30 Shoujo
31 Shoujo Ai
32 Shounen
33 Shounen Ai
34 Slice of Life
35 Space
36 Sports
37 Super Power
38 Supernatural
39 Thriller
40 Vampire
41 Yaoi
42 Yuri

We may wish to remove some genres for our visualisation. The example below will remove a single genre from the co-occurrence matrix and list of names, however, if you add more genre names to the discarded_categories list it will work for them too.

In [12]:
matrix = pd.DataFrame(matrix)
names = pd.DataFrame(names)

discarded_categories = ["Hentai", "Yaoi", "Yuri", "Ecchi",
                        "Shounen Ai", "Shoujo Ai"]

discard_mask = names.isin(discarded_categories).values
discard_indices = names[discard_mask].index

for drop_idx in discard_indices:
    matrix = matrix.drop(drop_idx, axis=1)
    matrix = matrix.drop(drop_idx, axis=0)
    names = names.drop(drop_idx, axis=0)   

Chord Diagram

Time to visualise the co-occurrence of genres using a chord diagram. We are going to use a list of custom colours that represent the genres.

In [13]:
colors = ["#660000", "#734139", "#e59173", "#ff4400", "#332b26", "#593000",
          "#998773", "#d97400", "#8c5e00", "#f2ca79", "#ffcc00", "#59562d",
          "#736b00", "#c2cc33", "#245900", "#8cff40", "#269926", "#ace6ac",
          "#40ffa6", "#336655", "#008c5e", "#39e6da", "#ace6e2", "#566d73",
          "#39c3e6", "#1d5673", "#3d9df2", "#163159", "#acc3e6", "#000f73",
          "#565a73", "#000033", "#8273e6", "#6d00cc", "#633366", "#e2ace6",
          "#f23de6", "#cc0088", "#590024", "#cc0036", "#f27999", "#e6acb4"];

Finally, we can put it all together.

In [14]:
Chord(
    matrix.values.tolist(),
    names.values.tolist(),
    padding=0.03,
    colors=colors,
    wrap_labels=False,
    margin=40,
    font_size="14px",
    font_size_large="14px",
    credit=True,
    noun = "Anime",
    allow_download=True
).show()
Chord Diagram
Download

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Co-occurrence of Movie Genres with Chord Diagrams

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from ast import literal_eval
from chord import Chord

Introduction

In this section, we're going to use the TMDB 5000 Movie Dataset dataset to visualise the co-occurrence of movie genres.

The Dataset

The dataset documentation states that we can expect 20 variables per each of the 4803 movies. Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://datacrayon.com/datasets/tmdb_5000_movies.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 [{"name": "Ingenious Film Partners", "id": 289... [{"iso_3166_1": "US", "name": "United States o... 2009-12-10 2787965087 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.2 11800
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 [{"name": "Walt Disney Pictures", "id": 2}, {"... [{"iso_3166_1": "US", "name": "United States o... 2007-05-19 961000000 169.0 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.sonypictures.com/movies/spectre/ 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.376788 [{"name": "Columbia Pictures", "id": 5}, {"nam... [{"iso_3166_1": "GB", "name": "United Kingdom"... 2015-10-26 880674609 148.0 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.3 4466
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... http://www.thedarkknightrises.com/ 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 [{"name": "Legendary Pictures", "id": 923}, {"... [{"iso_3166_1": "US", "name": "United States o... 2012-07-16 1084939099 165.0 [{"iso_639_1": "en", "name": "English"}] Released The Legend Ends The Dark Knight Rises 7.6 9106
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://movies.disney.com/john-carter 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.926995 [{"name": "Walt Disney Pictures", "id": 2}] [{"iso_3166_1": "US", "name": "United States o... 2012-03-07 284139100 132.0 [{"iso_639_1": "en", "name": "English"}] Released Lost in our world, found in another. John Carter 6.1 2124

It looks good so far, but let's confirm the 20 variables against 4803 samples from the documentation.

In [3]:
data.shape
Out[3]:
(4803, 20)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the column names there's a single column for genres, containing a string representation of a dictionary.

The first thing we need to do is evaluate these from strings into a type we can work with.

In [4]:
genres = data['genres'].apply(literal_eval)
genres
Out[4]:
0       [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
1       [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2       [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
3       [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...
4       [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
                              ...                        
4798    [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...
4799    [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...
4800    [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4801                                                   []
4802                  [{'id': 99, 'name': 'Documentary'}]
Name: genres, Length: 4803, dtype: object

The genres are now in lists of dictionaries. Let's convert them to lists of strings.

In [5]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        return sorted(names)

    return []
In [6]:
genres = genres.apply(get_list)
pd.DataFrame(genres)
Out[6]:
genres
0 [Action, Adventure, Fantasy, Science Fiction]
1 [Action, Adventure, Fantasy]
2 [Action, Adventure, Crime]
3 [Action, Crime, Drama, Thriller]
4 [Action, Adventure, Science Fiction]
... ...
4798 [Action, Crime, Thriller]
4799 [Comedy, Romance]
4800 [Comedy, Drama, Romance, TV Movie]
4801 []
4802 [Documentary]

4803 rows × 1 columns

Without further investigation, we can see that we have at least a few empty list values, [], in the table above, so we can remove all samples which contain an empty list.

In [7]:
genres = genres[genres.str.len() > 0]
pd.DataFrame(genres)
Out[7]:
genres
0 [Action, Adventure, Fantasy, Science Fiction]
1 [Action, Adventure, Fantasy]
2 [Action, Adventure, Crime]
3 [Action, Crime, Drama, Thriller]
4 [Action, Adventure, Science Fiction]
... ...
4797 [Foreign, Thriller]
4798 [Action, Crime, Thriller]
4799 [Comedy, Romance]
4800 [Comedy, Drama, Romance, TV Movie]
4802 [Documentary]

4775 rows × 1 columns

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by getting all combinations within each list.

In [8]:
genres = [list(itertools.combinations(i,2)) for i in genres]
pd.DataFrame(genres)
Out[8]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 (Action, Adventure) (Action, Fantasy) (Action, Science Fiction) (Adventure, Fantasy) (Adventure, Science Fiction) (Fantasy, Science Fiction) None None None None ... None None None None None None None None None None
1 (Action, Adventure) (Action, Fantasy) (Adventure, Fantasy) None None None None None None None ... None None None None None None None None None None
2 (Action, Adventure) (Action, Crime) (Adventure, Crime) None None None None None None None ... None None None None None None None None None None
3 (Action, Crime) (Action, Drama) (Action, Thriller) (Crime, Drama) (Crime, Thriller) (Drama, Thriller) None None None None ... None None None None None None None None None None
4 (Action, Adventure) (Action, Science Fiction) (Adventure, Science Fiction) None None None None None None None ... None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4770 (Foreign, Thriller) None None None None None None None None None ... None None None None None None None None None None
4771 (Action, Crime) (Action, Thriller) (Crime, Thriller) None None None None None None None ... None None None None None None None None None None
4772 (Comedy, Romance) None None None None None None None None None ... None None None None None None None None None None
4773 (Comedy, Drama) (Comedy, Romance) (Comedy, TV Movie) (Drama, Romance) (Drama, TV Movie) (Romance, TV Movie) None None None None ... None None None None None None None None None None
4774 None None None None None None None None None None ... None None None None None None None None None None

4775 rows × 21 columns

Now we will flatten the nested lists, this will give us all the genre pairings in original and reversed order.

In [9]:
genres = list(itertools.chain.from_iterable((i, i[::-1]) for c_ in genres for i in c_))
pd.DataFrame(genres)
Out[9]:
0 1
0 Action Adventure
1 Adventure Action
2 Action Fantasy
3 Fantasy Action
4 Action Science Fiction
... ... ...
24655 Romance Drama
24656 Drama TV Movie
24657 TV Movie Drama
24658 Romance TV Movie
24659 TV Movie Romance

24660 rows × 2 columns

Which we can now use to create the matrix.

In [10]:
matrix = pd.pivot_table(
    pd.DataFrame(genres), index=0, columns=1, aggfunc="size", fill_value=0
).values.tolist()

We can list this using a DataFrame for better presentation.

In [11]:
pd.DataFrame(matrix)
Out[11]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 465 26 258 276 3 339 62 144 5 58 76 9 57 63 277 1 547 55 35
1 465 0 114 223 56 2 183 211 190 2 27 20 7 37 66 205 0 203 30 22
2 26 114 0 125 0 0 19 195 61 1 0 0 14 1 8 30 1 3 1 2
3 258 223 125 0 180 11 576 299 166 9 11 78 84 27 484 109 4 113 8 17
4 276 56 0 180 0 1 381 8 10 4 13 36 11 105 47 18 1 414 4 9
5 3 2 0 11 1 0 7 5 0 2 6 1 15 1 0 0 0 1 0 0
6 339 183 19 576 381 7 0 121 99 27 175 84 106 175 603 102 5 554 118 34
7 62 211 195 299 8 5 121 0 149 3 1 1 32 6 52 58 3 7 0 3
8 144 190 61 166 10 0 99 149 0 0 0 53 11 19 64 85 0 63 3 2
9 5 2 1 9 4 2 27 3 0 0 3 0 0 0 9 0 0 3 1 0
10 58 27 0 11 13 6 175 1 0 3 0 1 6 3 30 0 0 21 59 8
11 76 20 0 78 36 1 84 1 53 0 1 0 3 91 15 95 1 291 1 1
12 9 7 14 84 11 15 106 32 11 0 6 3 0 3 61 2 1 5 1 3
13 57 37 1 27 105 1 175 6 19 0 3 91 3 0 24 47 0 242 3 2
14 63 66 8 484 47 0 603 52 64 9 30 15 61 24 0 31 3 64 26 12
15 277 205 30 109 18 0 102 58 85 0 0 95 2 47 31 0 0 211 2 1
16 1 0 1 4 1 0 5 3 0 0 0 1 1 0 3 0 0 1 0 0
17 547 203 3 113 414 1 554 7 63 3 21 291 5 242 64 211 1 0 24 7
18 55 30 1 8 4 0 118 0 3 1 59 1 1 3 26 2 0 24 0 3
19 35 22 2 17 9 0 34 3 2 0 8 1 3 2 12 1 0 7 3 0

Now for the names of our genres.

In [12]:
names = np.unique(genres).tolist()
pd.DataFrame(names)
Out[12]:
0
0 Action
1 Adventure
2 Animation
3 Comedy
4 Crime
5 Documentary
6 Drama
7 Family
8 Fantasy
9 Foreign
10 History
11 Horror
12 Music
13 Mystery
14 Romance
15 Science Fiction
16 TV Movie
17 Thriller
18 War
19 Western

Chord Diagram

Time to visualise the co-occurrence of genres using a chord diagram. We are going to use a list of custom colours that represent the genres.

In [13]:
colors = ["#e6194B", "#3cb44b", "#ffe119", "#4363d8", "#f58231",
    "#911eb4", "#42d4f4", "#f032e6", "#bfef45", "#fabebe",
    "#469990", "#e6beff", "#9A6324", "#fffac8", "#800000",
    "#aaffc3", "#a9a9a9", "#ffd8b1", "#000075", "#a9a9a9",];

Finally, we can put it all together.

In [14]:
Chord(
    matrix,
    names,
    colors=colors,
    wrap_labels=False,
    margin=50
).show()
Chord Diagram

Conclusion

In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!

Interactive Chord Diagrams

Preamble

In [2]:
:dep chord = {Version = "0.1.6"}
use chord::{Chord, Plot};

Introduction

In a chord diagram (or radial network), entities are arranged radially as segments with their relationships visualised by arcs that connect them. The size of the segments illustrates the numerical proportions, whilst the size of the arc illustrates the significance of the relationships1.

Chord diagrams are useful when trying to convey relationships between different entities, and they can be beautiful and eye-catching.

Get Chord Pro

Click here to get lifetime access to the full-featured chord visualization API, producing beautiful interactive visualizations, e.g. those featured on the front page of Reddit.

chord pro

  • Produce beautiful interactive Chord diagrams.
  • Customize colours and font-sizes.
  • Access Divided mode, enabling two sides to your diagram.
  • Symmetric and Asymmetric modes,
  • Add images and text on hover,
  • Access finer-customisations including HTML injection.
  • Allows commercial use without open source requirement.
  • Currently supports Python, JavaScript, and Rust, with many more to come (accepting requests).

chord pro

The Chord Package

With Python in mind, there are many libraries available for creating Chord diagrams, such as Plotly, Bokeh, and a few that are lesser-known. However, I wanted to use the implementation from d3 because it can be customised to be highly interactive and to look beautiful.

I couldn't find anything that ticked all the boxes, so I made a wrapper around d3-chord myself. It took some time to get it working, but I wanted to hide away everything behind a single constructor and method call. The tricky part was enabling multiple chord diagrams on the same page, and then loading resources in a way that would support Jupyter Notebooks.

You can get the package either from PyPi using pip install chord or from the GitHub repository. With your processed data, you should be able to plot something beautiful with just a single line, Chord(data, names).show(). To enable the pro features of the chord package, get Chord Pro.

The Chord Crate

I wasn't able to find any Rust crates for plotting chord diagrams, so I ported my own from Python to Rust.

You can get the crate either from crates.io or from the GitHub repository. With your processed data, you should be able to plot something beautiful with just a single line, Chord{ matrix : matrix, names : names, .. Chord::default() }.show(). To enable the pro features of the chord package, get Chord Pro.

The Dataset

The focus for this section will be the demonstration of the chord crate. To keep it simple, we will use synthetic data that illustrates the co-occurrences between movie genres within the same movie.

In [3]:
let matrix: Vec<Vec<f64>> = vec![
    vec![0., 5., 6., 4., 7., 4.],
    vec![5., 0., 5., 4., 6., 5.],
    vec![6., 5., 0., 4., 5., 5.],
    vec![4., 4., 4., 0., 5., 5.],
    vec![7., 6., 5., 5., 0., 4.],
    vec![4., 5., 5., 5., 4., 0.],
];

let names: Vec<String> = vec![
    "Action",
    "Adventure",
    "Comedy",
    "Drama",
    "Fantasy",
    "Thriller",
]
.into_iter()
.map(String::from)
.collect();

Chord Diagrams

Let's see what the Chord defaults produce when we invoke the show() method.

In [4]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    ..Chord::default()
}
.show();
Out[4]:
Chord Diagram

Different Colours

The defaults are nice, but what if we want different colours? You can pass in almost anything from d3-scale-chromatic, or you could pass in a list of hexadecimal colour codes.

In [21]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: vec![String::from("d3.schemeSet2")],
    ..Chord::default()
}
.show();
Out[21]:
Chord Diagram
In [20]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: vec![String::from(format!("d3.schemeGnBu[{:?}]",names.len()))],
    ..Chord::default()
}
.show();
Out[20]:
Chord Diagram
In [19]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: vec![String::from("d3.schemeSet3")],
    ..Chord::default()
}
.show();
Out[19]:
Chord Diagram
In [17]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: vec![String::from(format!("d3.schemePuRd[{:?}]",names.len()))],
    ..Chord::default()
}
.show();
Out[17]:
Chord Diagram
In [18]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: vec![String::from(format!("d3.schemeYlGnBu[{:?}]",names.len()))],
    ..Chord::default()
}
.show();
Out[18]:
Chord Diagram
In [14]:
let hex_colours : Vec<String> = vec!["#222222", "#333333", "#4c4c4c", "#666666", "#848484", "#9a9a9a"].into_iter()
.map(String::from)
.collect();

Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: true,
    colors: hex_colours,
    ..Chord::default()
}
.show();
Out[14]:
Chord Diagram

Label Styling

We can disable wrapped labels, and even change the colour.

In [11]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    wrap_labels: false,
    label_color:"#4c40bf".to_string(),
    ..Chord::default()
}
.show();
Out[11]:
Chord Diagram

Opacity

We can also change the default opacity of the relationships.

In [12]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    opacity: 0.1,
    ..Chord::default()
}
.show();
Out[12]:
Chord Diagram

Width

We can also change the maximum width the plot.

In [13]:
Chord {
    matrix: matrix.clone(),
    names: names.clone(),
    width: 400.0,
    wrap_labels: true,
    ..Chord::default()
}
.show()
Out[13]:
Chord Diagram

Conclusion

In this section, we've introduced the chord diagram and chord crate. We used the crate and some synthetic data to demonstrate several chord diagram visualisations with different configurations. The chord Python crate is available for free from crates.io or from the GitHub repository.


  1. Tintarev, N., Rostami, S., & Smyth, B. (2018, April). Knowing the unknown: visualising consumption blind-spots in recommender systems. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (pp. 1396-1399). 

Co-occurrence of Pokemon Types (Gen 1-8) with Chord Diagrams

Preamble

In [1]:
import numpy as np                   # for multi-dimensional containers 
import pandas as pd                  # for DataFrames
import itertools
from chord import Chord

Introduction

In previous sections, we visualised co-occurrences of Pokémon type. Whilst it was interesting to look at, the dataset only contained Pokémon from the first six geerations. In this section, we're going to use the Pokemon with stats Generation 8 dataset to visualise the co-occurrence of Pokémon types from generations one to eight.

The Dataset

The dataset documentation states that we can expect 51 variables per each of the 1028 Pokémon of the first eight generations.

Let's download the mirrored dataset and have a look for ourselves.

In [2]:
data_url = 'https://shahinrostami.com/datasets/pokemon_gen_1_to_8.csv'
data = pd.read_csv(data_url)
data.head()
Out[2]:
Unnamed: 0 pokedex_number name german_name japanese_name generation status species type_number type_1 ... against_ground against_flying against_psychic against_bug against_rock against_ghost against_dragon against_dark against_steel against_fairy
0 0 1 Bulbasaur Bisasam フシギダネ (Fushigidane) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
1 1 2 Ivysaur Bisaknosp フシギソウ (Fushigisou) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
2 2 3 Venusaur Bisaflor フシギバナ (Fushigibana) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
3 3 3 Mega Venusaur Bisaflor フシギバナ (Fushigibana) 1 Normal Seed Pokémon 2 Grass ... 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5
4 4 4 Charmander Glumanda ヒトカゲ (Hitokage) 1 Normal Lizard Pokémon 1 Fire ... 2.0 1.0 1.0 0.5 2.0 1.0 1.0 1.0 0.5 0.5

5 rows × 51 columns

It looks good so far, but let's confirm the 51 variables against 1028 samples from the documentation.

In [3]:
data.shape
Out[3]:
(1028, 51)

Perfect, that's exactly what we were expecting.

Data Wrangling

We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns type_1 and type_2.

In [4]:
pd.DataFrame(data.columns.values.tolist()).head(20)
Out[4]:
0
0 Unnamed: 0
1 pokedex_number
2 name
3 german_name
4 japanese_name
5 generation
6 status
7 species
8 type_number
9 type_1
10 type_2
11 height_m
12 weight_kg
13 abilities_number
14 ability_1
15 ability_2
16 ability_hidden
17 total_points
18 hp
19 attack

So let's select just these two columns and work with a list containing only them as we move forward.

In [5]:
types = pd.DataFrame(data[['type_1', 'type_2']].values)
types
Out[5]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
3 Grass Poison
4 Fire NaN
... ... ...
1023 Fairy NaN
1024 Fighting Steel
1025 Fighting NaN
1026 Poison Dragon
1027 Poison Dragon

1028 rows × 2 columns

Without further investigation, we can see that we have at least a few NaN values in the table above. We are only interested in co-occurrence of types, so we can remove all samples which contain a NaN value.

In [6]:
types = types.dropna()

We can also see an instance where the type Fighting at index $1014$ is followed by \n. We'll strip all these out before continuing.

In [7]:
types = types.replace('\n','', regex=True)
types
Out[7]:
0 1
0 Grass Poison
1 Grass Poison
2 Grass Poison
3 Grass Poison
6 Fire Flying
... ... ...
1021 Dragon Ghost
1022 Fairy Steel
1024 Fighting Steel
1026 Poison Dragon
1027 Poison Dragon

542 rows × 2 columns

Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.

We can build a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.

In [8]:
types = list(itertools.chain.from_iterable((i, i[::-1]) for i in types.values))

Which we can now use to create the matrix.

In [9]:
matrix = pd.pivot_table(
    pd.DataFrame(types), index=0, columns=1, aggfunc="size", fill_value=0
).values.tolist()

We can list this using a DataFrame for better presentation.

In [10]:
pd.DataFrame(matrix)
Out[10]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 0 0 0 4 2 4 4 14 1 6 2 2 0 13 2 5 7 5
1 0 0 4 2 3 3 4 5 3 3 3 2 5 5 3 2 2 7
2 0 4 0 3 1 2 3 8 5 5 9 3 1 4 5 2 2 3
3 4 2 3 0 2 0 1 6 1 1 1 2 2 3 1 3 4 3
4 2 3 1 2 0 0 0 2 1 5 0 1 5 1 9 3 5 4
5 4 3 2 0 0 0 7 1 1 3 0 1 4 2 6 1 4 3
6 4 4 3 1 0 7 0 7 5 0 4 1 2 2 3 3 1 1
7 14 5 8 6 2 1 7 0 3 7 4 2 27 3 7 6 3 8
8 1 3 5 1 1 1 5 3 0 12 6 1 0 4 3 0 4 2
9 6 3 5 1 5 3 0 7 12 0 1 3 2 15 3 2 3 3
10 2 3 9 1 0 0 4 4 6 1 0 3 1 2 2 9 6 10
11 2 2 3 2 1 1 1 2 1 3 3 0 0 0 4 2 2 7
12 0 5 1 2 5 4 2 27 0 2 1 0 0 0 5 0 0 1
13 13 5 4 3 1 2 2 3 4 15 2 0 0 0 0 1 0 6
14 2 3 5 1 9 6 3 7 3 3 2 4 5 0 0 2 9 6
15 5 2 2 3 3 1 3 6 0 2 9 2 0 1 2 0 7 11
16 7 2 2 4 5 4 1 3 4 3 6 2 0 0 9 7 0 1
17 5 7 3 3 4 3 1 8 2 3 10 7 1 6 6 11 1 0

Now for the names of our types.

In [11]:
names = np.unique(types).tolist()
pd.DataFrame(names)
Out[11]:
0
0 Bug
1 Dark
2 Dragon
3 Electric
4 Fairy
5 Fighting
6 Fire
7 Flying
8 Ghost
9 Grass
10 Ground
11 Ice
12 Normal
13 Poison
14 Psychic
15 Rock
16 Steel
17 Water

Chord Diagram

Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.

In [12]:
colors = ["#A6B91A", "#705746", "#6F35FC", "#F7D02C", "#D685AD",
          "#C22E28", "#EE8130", "#A98FF3", "#735797", "#7AC74C",
          "#E2BF65", "#96D9D6", "#A8A77A", "#A33EA1", "#F95587",
          "#B6A136", "#B7B7CE", "#6390F0"];
In [13]:
names
Out[13]:
['Bug',
 'Dark',
 'Dragon',
 'Electric',
 'Fairy',
 'Fighting',
 'Fire',
 'Flying',
 'Ghost',
 'Grass',
 'Ground',
 'Ice',
 'Normal',
 'Poison',
 'Psychic',
 'Rock',
 'Steel',
 'Water']

Finally, we can put it all together.

In [14]:
Chord(matrix, names, colors=colors).show()
Chord Diagram

Chord Diagram with Names

Note

The following example uses a customised version of Chord that supports the presentation of additional information.

It would be nice to show a list of Pokémon names when hovering over co-occurring Pokémon types. To do this, we can make use of the optional details parameter.

Let's clean up our dataset by removing all instances of \n.

In [15]:
data = data.replace('\n','', regex=True)

Next, we'll create an empty multi-dimensional array with the same shape as our matrix.

In [16]:
details = np.empty((len(names),len(names)),dtype=object)

Now we can populate the details array with lists of Pokémon names in the correct positions.

In [17]:
for count_x, item_x in enumerate(names):
    for count_y, item_y in enumerate(names):
        details[count_x][count_y] = data[
            (data['type_1'].isin([item_x, item_y])) &
            (data['type_2'].isin([item_y, item_x]))]['name'].to_list()

details=pd.DataFrame(details).values.tolist()

Finally, we can put it all together but this time with the details matrix passed in.

In [18]:
Chord(
    matrix,
    names,
    colors=colors,
    details=details,
    credit=True
).show()
Chord Diagram