Data is Beautiful
A practical book on data visualisation that shows you how to create static and interactive visualisations that are engaging and beautiful.
Get the book
US Mortality - Race and Manner of Death
Contents Chat Share Follow Download Source
Get access to this section and more
Below is a featured selection from this section. You can access this notebook and more by getting the e-book, Data is Beautiful.
Preamble¶
import numpy as np # for multi-dimensional containers
import pandas as pd # for DataFrames
import itertools
from chord import Chord
Introduction¶
In previous sections, we visualised co-occurrences of Pokémon type. Whilst it was interesting to look at, the dataset only contained Pokémon from the first six geerations. In this section, we're going to use the TidyTuesday Animal Crossing villagers dataset to visualise the relationship between Species and .
The Dataset¶
The dataset documentation states that we can expect 13 variables per each of the 1017 Pokémon of the first eight generations.
Let's download the mirrored dataset and have a look for ourselves.
data_url = '/Users/shahin/Documents/devel/data/2015_data.csv'
data = pd.read_csv(data_url)
data.head()
/Users/shahin/opt/miniconda3/envs/analytics/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (40,41,42,43,61,62,63,64) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
resident_status | education_1989_revision | education_2003_revision | education_reporting_flag | month_of_death | sex | detail_age_type | detail_age | age_substitution_flag | age_recode_52 | ... | record_condition_18 | record_condition_19 | record_condition_20 | race | bridged_race_flag | race_imputation_flag | race_recode_3 | race_recode_5 | hispanic_origin | hispanic_originrace_recode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | NaN | 3.0 | 1 | 1 | M | 1 | 84 | NaN | 42 | ... | NaN | NaN | NaN | 1 | NaN | NaN | 1 | 1 | 100 | 6 |
1 | 1 | NaN | 6.0 | 1 | 1 | M | 1 | 70 | NaN | 40 | ... | NaN | NaN | NaN | 1 | NaN | NaN | 1 | 1 | 100 | 6 |
2 | 1 | NaN | 3.0 | 1 | 1 | F | 1 | 91 | NaN | 44 | ... | NaN | NaN | NaN | 1 | NaN | NaN | 1 | 1 | 100 | 6 |
3 | 1 | NaN | 3.0 | 1 | 1 | F | 1 | 40 | NaN | 34 | ... | NaN | NaN | NaN | 3 | NaN | NaN | 2 | 3 | 100 | 8 |
4 | 1 | NaN | 5.0 | 1 | 1 | F | 1 | 89 | NaN | 43 | ... | NaN | NaN | NaN | 1 | NaN | NaN | 1 | 1 | 100 | 6 |
5 rows × 77 columns
data['race_recode_5'].value_counts()
1 2311103 2 320759 4 67295 3 19041 Name: race_recode_5, dtype: int64
data.columns
Index(['resident_status', 'education_1989_revision', 'education_2003_revision', 'education_reporting_flag', 'month_of_death', 'sex', 'detail_age_type', 'detail_age', 'age_substitution_flag', 'age_recode_52', 'age_recode_27', 'age_recode_12', 'infant_age_recode_22', 'place_of_death_and_decedents_status', 'marital_status', 'day_of_week_of_death', 'current_data_year', 'injury_at_work', 'manner_of_death', 'method_of_disposition', 'autopsy', 'activity_code', 'place_of_injury_for_causes_w00_y34_except_y06_and_y07_', 'icd_code_10th_revision', '358_cause_recode', '113_cause_recode', '130_infant_cause_recode', '39_cause_recode', 'number_of_entity_axis_conditions', 'entity_condition_1', 'entity_condition_2', 'entity_condition_3', 'entity_condition_4', 'entity_condition_5', 'entity_condition_6', 'entity_condition_7', 'entity_condition_8', 'entity_condition_9', 'entity_condition_10', 'entity_condition_11', 'entity_condition_12', 'entity_condition_13', 'entity_condition_14', 'entity_condition_15', 'entity_condition_16', 'entity_condition_17', 'entity_condition_18', 'entity_condition_19', 'entity_condition_20', 'number_of_record_axis_conditions', 'record_condition_1', 'record_condition_2', 'record_condition_3', 'record_condition_4', 'record_condition_5', 'record_condition_6', 'record_condition_7', 'record_condition_8', 'record_condition_9', 'record_condition_10', 'record_condition_11', 'record_condition_12', 'record_condition_13', 'record_condition_14', 'record_condition_15', 'record_condition_16', 'record_condition_17', 'record_condition_18', 'record_condition_19', 'record_condition_20', 'race', 'bridged_race_flag', 'race_imputation_flag', 'race_recode_3', 'race_recode_5', 'hispanic_origin', 'hispanic_originrace_recode'], dtype='object')
capitalise the name, personality, and species of each villager.
#data['manner'] = data['manner_of_death']#.str.capitalize()
#data['race_recode_5'] = data['race_recode_5']#.str.capitalize()
#data['species'] = data['species'].str.capitalize()
It looks good so far, but let's confirm the 13 variables against 1017 samples from the documentation.
data.shape
(2718198, 77)
Perfect, that's exactly what we were expecting.
Data Wrangling¶
We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns Type 1
and Type 2
.
pd.DataFrame(data.columns.values.tolist())
0 | |
---|---|
0 | resident_status |
1 | education_1989_revision |
2 | education_2003_revision |
3 | education_reporting_flag |
4 | month_of_death |
... | ... |
72 | race_imputation_flag |
73 | race_recode_3 |
74 | race_recode_5 |
75 | hispanic_origin |
76 | hispanic_originrace_recode |
77 rows × 1 columns
So let's select just these two columns and work with a list containing only them as we move forward.
data.fillna(0, inplace=True)
data.iloc[6572].manner_of_death
7.0
data.manner_of_death.value_counts()
7.0 2107352 0.0 388364 1.0 143961 2.0 44417 3.0 18885 5.0 11054 4.0 4165 Name: manner_of_death, dtype: int64
{'143961': 'Accident',
'44417': 'Suicide',
'18885': 'Homicide',
'4165': 'Pending investigation',
'11054': 'Could not determine',
'2107352': 'Natural',
'388364': 'Not specified'}
import json
with open("/Users/shahin/Documents/devel/data/2015_data.json", "r") as read_file:
codes = json.load(read_file)
codes['manner_of_death']
{'1': 'Accident', '2': 'Suicide', '3': 'Homicide', '4': 'Pending investigation', '5': 'Could not determine', '7': 'Natural', 'Blank': 'Not specified'}
codes['manner_of_death']['0'] = codes['manner_of_death'].pop('Blank')
remove = ["Natural", "Not specified", "Could not determine", "Pending investigation"]
list(codes['manner_of_death'].values())
['Accident', 'Suicide', 'Homicide', 'Pending investigation', 'Could not determine', 'Natural', 'Not specified']
left = list(codes['race_recode_5'].values())
pd.DataFrame(left)
0 | |
---|---|
0 | White |
1 | Black |
2 | American Indian |
3 | Asian or Pacific Islander |
right = list(codes['manner_of_death'].values())
pd.DataFrame(right)
0 | |
---|---|
0 | Accident |
1 | Suicide |
2 | Homicide |
3 | Pending investigation |
4 | Could not determine |
5 | Natural |
6 | Not specified |
right = [x for x in right if x not in remove]
data['manner_of_death'] = data['manner_of_death'].astype('int32')
data['manner_of_death'] = data['manner_of_death'].astype('str')
data.iloc[6572].manner_of_death
'7'
left.sort()
right.sort()
left
['American Indian', 'Asian or Pacific Islander', 'Black', 'White']
data = data.replace({"manner_of_death": codes['manner_of_death']})
data['race_recode_5'] = data['race_recode_5'].astype('int32')
data['race_recode_5'] = data['race_recode_5'].astype('str')
data.iloc[6572].race_recode_5
'2'
data = data.replace({"race_recode_5": codes['race_recode_5']})
data.iloc[6572].race_recode_5
'Black'
manner_race = pd.DataFrame(data[['manner_of_death', 'race_recode_5']].values)
manner_race
0 | 1 | |
---|---|---|
0 | Natural | White |
1 | Natural | White |
2 | Natural | White |
3 | Homicide | American Indian |
4 | Natural | White |
... | ... | ... |
2718193 | Natural | Black |
2718194 | Natural | White |
2718195 | Natural | White |
2718196 | Natural | Black |
2718197 | Natural | Black |
2718198 rows × 2 columns
Now for the names of our types.
Which we can now use to create the matrix.
features= left+right
d = pd.DataFrame(0, index=features, columns=features)
Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.
We can build a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.
manner_race = list(itertools.chain.from_iterable((i, i[::-1]) for i in manner_race.values))
for x in manner_race:
if(x[0] not in remove and x[1] not in remove):
d.at[x[0], x[1]] += 1
d
American Indian | Asian or Pacific Islander | Black | White | Accident | Homicide | Suicide | |
---|---|---|---|---|---|---|---|
American Indian | 0 | 0 | 0 | 0 | 2067 | 304 | 582 |
Asian or Pacific Islander | 0 | 0 | 0 | 0 | 3032 | 364 | 1334 |
Black | 0 | 0 | 0 | 0 | 15746 | 9419 | 2518 |
White | 0 | 0 | 0 | 0 | 123116 | 8798 | 39983 |
Accident | 2067 | 3032 | 15746 | 123116 | 0 | 0 | 0 |
Homicide | 304 | 364 | 9419 | 8798 | 0 | 0 | 0 |
Suicide | 582 | 1334 | 2518 | 39983 | 0 | 0 | 0 |
for race in left:
d.loc[ race , : ] = ((d.loc[ race , : ] / d.loc[ race , : ].sum()) * 100)
d.loc[ : , race ] = ((d.loc[ : , race ] / d.loc[ : , race ].sum()) * 100)
(d[race].value_counts(normalize=True)*100).astype(int)
0.000000 57 71.621960 14 23.259859 14 5.118181 14 Name: White, dtype: int64
d
American Indian | Asian or Pacific Islander | Black | White | Accident | Homicide | Suicide | |
---|---|---|---|---|---|---|---|
American Indian | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 69.996614 | 10.294616 | 19.708771 |
Asian or Pacific Islander | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 64.101480 | 7.695560 | 28.202960 |
Black | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 56.879673 | 34.024492 | 9.095835 |
White | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 71.621960 | 5.118181 | 23.259859 |
Accident | 69.996614 | 64.10148 | 56.879673 | 71.621960 | 0.000000 | 0.000000 | 0.000000 |
Homicide | 10.294616 | 7.69556 | 34.024492 | 5.118181 | 0.000000 | 0.000000 | 0.000000 |
Suicide | 19.708771 | 28.20296 | 9.095835 | 23.259859 | 0.000000 | 0.000000 | 0.000000 |
Chord Diagram¶
Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.
colors = ["#2DE1FC","#1883B4","#C5DB66","#90B64D","#DB2B39","#E76926", "#DB9118"]
names = left + right
Finally, we can put it all together.
names[1]= "Asian or PI"
names[0]= "AIAN"
d
American Indian | Asian or Pacific Islander | Black | White | Accident | Homicide | Suicide | |
---|---|---|---|---|---|---|---|
American Indian | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 69.996614 | 10.294616 | 19.708771 |
Asian or Pacific Islander | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 64.101480 | 7.695560 | 28.202960 |
Black | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 56.879673 | 34.024492 | 9.095835 |
White | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 71.621960 | 5.118181 | 23.259859 |
Accident | 69.996614 | 64.10148 | 56.879673 | 71.621960 | 0.000000 | 0.000000 | 0.000000 |
Homicide | 10.294616 | 7.69556 | 34.024492 | 5.118181 | 0.000000 | 0.000000 | 0.000000 |
Suicide | 19.708771 | 28.20296 | 9.095835 | 23.259859 | 0.000000 | 0.000000 | 0.000000 |
Finally, we can put it all together but this time with the details
matrix passed in.
Chord(
d.round(2).values.tolist(),
names,
colors=colors,
credit=True,
wrap_labels=True,
margin=50,
font_size_large=7,
divide=True,
noun="percent",
divide_idx=len(left),
divide_size=.2,
width=850).show()
Chord Diagram with Names¶
It would be nice to show a list of Pokémon names when hovering over co-occurring Pokémon types. To do this, we can make use of the optional details
parameter.
d
American Indian | Asian or Pacific Islander | Black | White | Accident | Homicide | Suicide | |
---|---|---|---|---|---|---|---|
American Indian | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 69.996614 | 10.294616 | 19.708771 |
Asian or Pacific Islander | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 64.101480 | 7.695560 | 28.202960 |
Black | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 56.879673 | 34.024492 | 9.095835 |
White | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 71.621960 | 5.118181 | 23.259859 |
Accident | 69.996614 | 64.10148 | 56.879673 | 71.621960 | 0.000000 | 0.000000 | 0.000000 |
Homicide | 10.294616 | 7.69556 | 34.024492 | 5.118181 | 0.000000 | 0.000000 | 0.000000 |
Suicide | 19.708771 | 28.20296 | 9.095835 | 23.259859 | 0.000000 | 0.000000 | 0.000000 |
Next, we'll create an empty multi-dimensional array with the same shape as our matrix
.
details = np.empty((len(names),len(names)),dtype=object)
details_thumbs = np.empty((len(names),len(names)),dtype=object)
Now we can populate the details
array with lists of Pokémon names in the correct positions.
for count_x, item_x in enumerate(names):
for count_y, item_y in enumerate(names):
details_urls = data[
(data['species'].isin([item_x, item_y])) &
(data['personality'].isin([item_y, item_x]))]['url'].to_list()
details_names = data[
(data['species'].isin([item_x, item_y])) &
(data['personality'].isin([item_y, item_x]))]['name'].to_list()
urls_names = np.column_stack((details_urls, details_names))
if(urls_names.size > 0):
details[count_x][count_y] = details_names
details_thumbs[count_x][count_y] = details_urls
else:
details[count_x][count_y] = []
details_thumbs[count_x][count_y] = []
details=pd.DataFrame(details).values.tolist()
details_thumbs=pd.DataFrame(details_thumbs).values.tolist()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/opt/miniconda3/envs/analytics/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2645 try: -> 2646 return self._engine.get_loc(key) 2647 except KeyError: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'species' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-40-f54e3a0b2906> in <module> 2 for count_y, item_y in enumerate(names): 3 details_urls = data[ ----> 4 (data['species'].isin([item_x, item_y])) & 5 (data['personality'].isin([item_y, item_x]))]['url'].to_list() 6 ~/opt/miniconda3/envs/analytics/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key) 2798 if self.columns.nlevels > 1: 2799 return self._getitem_multilevel(key) -> 2800 indexer = self.columns.get_loc(key) 2801 if is_integer(indexer): 2802 indexer = [indexer] ~/opt/miniconda3/envs/analytics/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2646 return self._engine.get_loc(key) 2647 except KeyError: -> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650 if indexer.ndim > 1 or indexer.size > 1: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'species'
len(right)
Chord(d.values.tolist(), names,credit=True, colors=colors, wrap_labels=False,
margin=40, font_size_large=7,details=details,details_thumbs=details_thumbs,noun="villagers",
details_separator="", divide=True, divide_idx=len(left),divide_size=.2, width=850).show()
np.empty(shape=(6,1)).tolist()
Conclusion¶
In this section, we demonstrated how to conduct some data wrangling on a downloaded dataset to prepare it for a chord diagram. Our chord diagram is interactive, so you can use your mouse or touchscreen to investigate the co-occurrences!
Get access to this section and more
You can access this notebook and more by getting the e-book, Data is Beautiful.
Support this work
Get the practical book on data visualisation that shows you how to create static and interactive visualisations that are engaging and beautiful.
Data is Beautiful
A practical book on data visualisation that shows you how to create static and interactive visualisations that are engaging and beautiful.
Get the book