Week 02: Introduction to working with data#

Learning objectives#

In this lesson we learn how to summarize data in a DataFrame, and do basic data management tasks such as making new variables, recoding data, and dealing with missing data.

After completing this lesson, you should be able to

  • identify missing values in a dataset

  • summarize variables contained in a DataFrame

  • make new variables and insert them into a DataFrame

  • edit variables contained in a DataFrame

Throughout this notebook, we’ll use the dataset msleep from the packages plotnine.

Preparation#

To follow along with this Lesson, please open the Colab notebook Week 02 Notes. The first code cell of this notebook calls to the remote computer, on which the notebook is running, and installs the necessary packages. For practice, you are repsonible for importing the necessary packages.

Missing data#

Missing data occurs when the value for a variable is missing. Think of it as a blank cell in an spreadsheet. Missing values can cause some problems during analysis, so let’s see how to detect missing values and how to work around them.

Let’s use the dataset msleep from the package plotnine.

from plotnine.data import msleep

import plotnine as pn
import numpy as np

msleep
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
0 Cheetah Acinonyx carni Carnivora lc 12.1 NaN NaN 11.9 NaN 50.000
1 Owl monkey Aotus omni Primates NaN 17.0 1.8 NaN 7.0 0.01550 0.480
2 Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NaN 9.6 NaN 1.350
3 Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.133333 9.1 0.00029 0.019
4 Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.666667 20.0 0.42300 600.000
... ... ... ... ... ... ... ... ... ... ... ...
78 Tree shrew Tupaia omni Scandentia NaN 8.9 2.6 0.233333 15.1 0.00250 0.104
79 Bottle-nosed dolphin Tursiops carni Cetacea NaN 5.2 NaN NaN 18.8 NaN 173.330
80 Genet Genetta carni Carnivora NaN 6.3 1.3 NaN 17.7 0.01750 2.000
81 Arctic fox Vulpes carni Carnivora NaN 12.5 NaN NaN 11.5 0.04450 3.380
82 Red fox Vulpes carni Carnivora NaN 9.8 2.4 0.350000 14.2 0.05040 4.230

83 rows × 11 columns

We can see that a few variables contain the value NaN, which stands for not-a-number. These are the missing data. Notice that only the first five and last five rows of the DataFrame are printed. The ellipsis, …, in the middle row indicates not all rows are printed by default. So we may not be able to see all the variables with NaNs. Each column/Series has a property .hasnans which returns True if the Series on which it is called has any NaNs.

msleep["conservation"].hasnans
True

Missing values are tricky to deal with in general. Much of the time in Pandas, it seems like things just work out well for you. For instance, calculating a mean or a standard deviation automatically ignores NaNs.

np.mean(msleep["brainwt"])
np.float64(0.28158142857142854)

But other times, NaNs are not automatically ignored, and nothing about the code or its output tells you when NaNs are ignored or not.

np.size(msleep["brainwt"])
83

The plotting package plotnine, by default, includes NaNs as its own category, which can be undesirable.

p = pn.ggplot(data = msleep) + pn.geom_bar(pn.aes(x = "conservation"))
p.draw()
_images/abab35f4a72907d1014edea32eb0dfa0d5c15c712cfa4f0a5ae7230424b36381.png

To remove the missing data from a plot, create a new DataFrame by dropping the mising data from the column of interest. You should always specify the keyword argument subset, lest you drop all missing data, which might drop a row where the data is missing in a column you care less about and subsequently drops data from the row you do care about.

df = msleep.dropna(subset = "conservation")
p = pn.ggplot(data = df) + pn.geom_bar(pn.aes(x = "conservation"))
p.draw()
_images/4c176d0ee20373cd8a7559761810b31019cc4e31c051fb177b08bde8fc96e873.png

The code equivalent of the last plot above is as follows.

msleep["conservation"].value_counts(dropna = True)
conservation
lc              27
domesticated    10
vu               7
en               4
nt               4
cd               2
Name: count, dtype: int64

Summarize data#

In the subsection above, Missing data, we learned that accounting for missing data can be tricky. The Pandas property describe which excludes NaN values. Thus, you have a choice. Use np.size if you want a count of the number of elements including missing data, NaNs. Or use .describe if you want a count of the number of elements excluding missing data.

Here, we call .describe on a Series of type category elements, that is a categorical variable named conservation.

msleep["conservation"].describe()
count     54
unique     6
top       lc
freq      27
Name: conservation, dtype: object

Here, we call .describe on a Series of type float64 elements, that is a numeric variables named brainwt.

msleep["brainwt"].describe()
count    56.000000
mean      0.281581
std       0.976414
min       0.000140
25%       0.002900
50%       0.012400
75%       0.125500
max       5.712000
Name: brainwt, dtype: float64

Since there is not much to be done with missing values other than count them, here is a line of code you can use to count the number of missing values.

nan_idx = msleep["conservation"].isna()
np.sum(nan_idx)
np.int64(29)

The property/function .isna returns a boolean Series where each missing value is indexed by True and each non-missing value is indexed by False. Since True is equivalent to the number 1 and False is equivalent to 0 in Python, we can simply sum up the number of Trues and Falses to count the number of missing data.

Creating/Editing new variables#

Imagine you want to create a new variable brnbdywt in the DataFrame msleep. This new variable might tell you something about how smart the animal is. The bigger the brain to body weight ratio, maybe the smarter the animal is.

Just like for a dict, you can index into a DataFrame with a column name that does not yet exist, so long as you first assign to that column with some expression. For instance, let’s create a new variable brnbdywt inside the DataFrame msleep, and assign to it the ratio of the columns brainwt and bodywt.

msleep["brnbdywt"] = msleep["brainwt"] / msleep["bodywt"]

The division on the right hand side of = is done element-wise. Therefore, the Series created by the division of two Series has the same size as every column in msleep. We have thus assigned to the new column brnbdywt a new variable that was created by the ratio of brainwt to bodywt.

msleep["brnbdywt"].describe()
count    56.000000
mean      0.010356
std       0.009251
min       0.000705
25%       0.003794
50%       0.006956
75%       0.015244
max       0.039604
Name: brnbdywt, dtype: float64

We can take this one step further. Let’s create a new boolean variable named smrt, which will hold True whenever an animal is in the top 25% of the brain to body weight ratio and False otherwise. We start by creating the new column and making every value equal to False.

msleep["smrt"] = False

Since the right hand side of the equals is scalar, just one value, it is recycled as much as necessary to create a new column in msleep to keep the requirement that every column of a DataFrame has the same size.

Next, we create a boolean Series that indicates when an animal qualifies as smrt. Then, we over-write with the value True just those rows for which we’ve identified a smrt animal. Hence, we can edit a variable in a DataFrame using the indexing strategy we learned about in Week 01: DataFrames.

smrt_idx = msleep["brnbdywt"] >= 0.015
msleep.loc[smrt_idx, "smrt"] = True

Categorical variables#

A categorical variable is a variable that has names or labels as values. We basically created a categorical variable above, with only two values True and False, and named it smrt. The only thing we haven’t done is convinced Python/Pandas to treat the variable smrt as if the elements are of type category. Let’s fix this.

msleep["smrt"] = msleep["smrt"].astype("category")
msleep["smrt"]
0     False
1      True
2     False
3      True
4     False
      ...  
78     True
79    False
80    False
81    False
82    False
Name: smrt, Length: 83, dtype: category
Categories (2, bool): [False, True]

Invariably with categorical variables, the categories you have are not what you want. Let’s change the categories from False and True to something else.

msleep["smrt"] = msleep["smrt"].cat.rename_categories({False: "nope", True: "yup"})

There’s two things to note here. First, the property .cat can only be called on a categorical Series; the elements must be recognized by Python/Pandas as type category. Second, the argument to .rename_categories is a dict with keys equal to the categories that you have, but don’t want, and values equal to the categories you want. The pattern in pseudo-code might be written as {"old": "new"}. The problem with this is that the keys don’t have to be of type str, as we saw above.

We now have a variable with categories as names or labels, instead of False and True, we can develop more categories. Let’s break up the "Nope"s into two groups: "meh" for so-so smart animals and "Nope". The "meh" group is in the middle. In order to add this new category, we need to prime the Series smrt for the fact that we want to add a category that doesn’t yet exist.

msleep["smrt"] = msleep["smrt"].cat.add_categories(["meh", "doh"])

Next, let’s create a boolean Series which indicates the exact rows which should be labeled as "meh". The variable meh_idx is one boolean Series created with logical and & of two boolean Series. In an expression such as a & b, where a and b are both boolean Series, each element of a is compared to each element of b. When both elements are True, the corresponding element of meh_idx is True. If either element is False, the corresponding element of meh_idx id False.

meh_idx = (0.004 <= msleep["brnbdywt"]) & (msleep["brnbdywt"] < 0.015)
msleep.loc[meh_idx, "smrt"] = "meh"
msleep["smrt"]
0     nope
1      yup
2     nope
3      yup
4     nope
      ... 
78     yup
79    nope
80     meh
81     meh
82     meh
Name: smrt, Length: 83, dtype: category
Categories (4, object): ['nope', 'yup', 'meh', 'doh']

The character &, read as ampersand, above is the logical equivalent of the word and. The character |, read as pipe, is the logical equivalent of the word or. To type an ampersand hold shift and press the number 7 above the letters Y and U on your keyboard. To type a pipe hold shift and press the backslash key, which is in between enter/return and backspace/delete.

Did you notice that I added the unused category "doh"? On the one hand, this shows that you can add multiple new categories at once, just use a list. On the other hand, you may very well end up with an unsed category after renaming various categories. Here’s how you can remove any unused categories.

msleep["smrt"] = msleep["smrt"].cat.remove_unused_categories()
msleep["smrt"]
0     nope
1      yup
2     nope
3      yup
4     nope
      ... 
78     yup
79    nope
80     meh
81     meh
82     meh
Name: smrt, Length: 83, dtype: category
Categories (3, object): ['nope', 'yup', 'meh']

The function remove_unused_categories() is a safe bet, because no used category will be removed. Alternatively, the function remove_categories([...]) will remove any specified categories, whether or not they are used. The function documentation warns “Values which were in the removed categories will be set to NaN”.