Week 01: Python basics#

Learning objectives#

This lesson is designed to explain the basics of programming in Python.

  • learn and practice computer science terms: variable, comment, call, function, arguments, keyword argument, positional argument, default values, float, int, string, boolean, list, dict, object, package, array, DataFrame, Series

  • calculate arithmetic operations

  • call functions and change default values

  • indexing

  • logical statements

  • array

  • DataFrame, Series, some details

Preparation#

To follow along with this Lesson, please open the Colab notebook Week 01 Notes. The first code cell of this notebook calls to the remote computer, on which the notebook is running, and installs the necessary packages. For practice, you are responsible for importing the necessary packages.

Variable#

A variable consists of a name and value, where the name references the value. The equals sign = assigns to a name a specific value. The code below creates a variable named weight_kg which holds the value 55.

weight_kg = 55

The code above follows the general pattern in Python for creating variables: variable = expression. The left hand side of the = specifies a name and the right hand side can be any valid Python expression.

weight_lb = weight_kg * 2.2

Notice that the right hand side is now a Python/mathematical expression that uses the variable weight_kg to create a new value (55 * 2.2) and creates a new variable, named weight_lb. We have thus assigned to the variable weight_lb the value as determined by the expression weight_kg * 2.2.

If we want to leave future readers of our code a note, within the code, to remind them of what this code is doing, we can leave a comment. Comments are ignored by Python, and are intended to help readers of your code.

weight_lb = weight_kg * 2.2 # convert kg to pounds

It is easy to get carried away with comments. When you are starting, this is not necessarily a bad thing. As your programming skills progress, your comments should focus less and less on how to read Python code and should instead focus more and more on the details surrounding why the code is written as it is. This is a fine line.

Functions#

This section will focus on using Python functions. We will look at writing functions later. A Python function is a named and bundled sequence of actions to perform on one or more variables. The inputs are called arguments and the outputs (if there are any) are called return values.

Consider the function abs, which computes the absolute value of its argument.

x = -3.1415
abs(x)
3.1415

The input is the variable x, which holds the value -3.1415, and the output is the value 3.1415, which itself is not assigned to any variable.

The general pattern in Python to call a function named f on a single argument x is f(x). If there is more than one argument, say x1, x2, and, xN, the the arguments are separated by commas, f(x1, x2, ..., xN).

The function round accepts as the first argument a number and a second, named, argument the number of digits to round the first argument to. If a function argument is named, it is called a keyword arguments. If a function argument is not named, it is called a positional argument.

round(x, ndigits = 1)
-3.1

Keyword arguments have a few benefits. Keyword arguments can make code easier to read, so that you and/or the reader of your code is (somewhat) reminded what the argument 1 is doing. Keyword arguments also can have default values. The argument ndigits of the function round defaults to None, if no other value is specified, which in this case instructs round to round the first positional argument to the nearest integer.

round(x)
-3

If you put arguments in the correct order, you do not need to specify the name of the keyword arguments, but this isn’t always good practice.

round(x, 3)
-3.142

Variable types#

Each variable in Python is of a specific type. Although we may think of the numbers 3 and 3.0 as mathematically equivalent, these numbers are of different types in Python (and most other programming languages).

type(x)
float
type(3)
int

The Python type float is for variables that reference (something like) real valued numbers, numbers with decimals, like the value -3.1415. The type int is for integer values.

There are many other Python types. Here’s an incomplete list of Python specific types: str, bool, list, and dict. We’ll look at str and bool immediately, and save list and dict for the next section.

The type str stands for string. Strings are any sequence of characters between double " or single ' quotes.

x = "MATH 131"
type(x)
str

Notice that I re-assigned the variable x to now reference the value "MATH 131". Most programming languages allow re-assignment of variables. Dynamic languages, such as Python, go futher and allow re-assignment of variables even across two different types: float -> str.

The type bool stands for boolean. Boolean is the computer science term for the two specific values, which are spelled True and False in Python.

y = True
z = "True"
y == z
False

Spend a second with the code above. The variable y is of type bool because it references one of the special Boolean values, specifically True. The variable z is of type str because it references a string which contains the word True. Python considers y and z as different, since they are different types and have different values. The last line indicates that Python treats y and z as not equal. The Python code y == z is like asking the question, is y equal to z?, to which Python returns the Boolean value False.

Novice programmers sometimes think the different types are burdensome. With some practice though, you will learn that types help programmers write good and logical code. For instance, if you write code like

abs(x)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 abs(x)

TypeError: bad operand type for abs(): 'str'

The TypeError is a friendly reminder that you re-assigned the variable x to hold a str value and the function abs has no meaning when applied to a str type. If nothing else, we must get used to dealing with types appropriately.

There are in fact many more variable types in Python and in programming more generally. For now, we’ll skip to variable types that store (plural) data.

Data structures#

Variable types that contain more than one datum, data, are generally referred to as data structures. Two important Python types for storing data are list and dict.

Both list and dict are types that contain multiple elements. For both types, the contained elements themselves do not need to be of the same type. What differentiates a list from a dict is how the contained elements are indexed (and also how the data structures are stored in memory).

List#

A list indexes its elements with ints or slices of integers. The code below creates a variable named l which references a list of heterogeneous types, and then indexes the first element of the list l.

lst = [12.4, 7.3, "Chico State", True]
lst[1]
7.3

To define the list l the expression on the right hand side of = surrounds, with square brackets, a comma separated list of values. Any left square bracket [ must be followed by a right square bracket ] to complete a list-defining expression.

The syntax l[1] indexes the list and retreives the first (not the zero-th) element of the list. Lists in Python are said to be zero indexed, since the zero-th element of the list l is the value 1.

One can also index a list with a slice of integers, e.g. 0:2.

lst[0:2]
[12.4, 7.3]

Notice though, despite the slice 0:2, the second element of lst is not retreived. The slice a:b indexes from a list all elements at position a up to but not including b.

By not specifying the beginning of a slice, one can index from the beginning up to but not including the specified end point b.

lst[:2]
[12.4, 7.3]

Or the opposite, where the beginning is specified a, but the end is not.

lst[2:]
['Chico State', True]

For sure, indexing lists with slices takes a fair bit of practice.

Advanced and optional (not required)#

[lst[0:2], lst[1:], lst[:1], lst[0:3:2]]
[[12.4, 7.3], [7.3, 'Chico State', True], [12.4], [12.4, 'Chico State']]

The above code creates a new unreferenced list, which containes as elements sub-lists of the list lst. The zero-th sub-list consists of the zero-th and first elements of lst.

The slice 1: indexes the first element up to and including the last element.

The slice :1 indexes the zero-th element up to but not includes the first element.

The slice 0:3:2 indexes the zeros-th and the second element, since the :2 indicates steps of size 2 starting from index 0 and going up to index 3.

One can also index a list in reverse order. Imagine wrapping the indices around the back of the list starting from the zero-th element. Thus the -1-th element is the last element.

lst[-1]
True

And for fun, even crazier indexing expressions exist. This effectively reverses a list.

lst[::-1]
[True, 'Chico State', 7.3, 12.4]

The slice ::-1 says to start at index 0, go up to the last index, and take steps of size -1. Hence, one gets a list in reverse order.

Dictionary#

A dictionary indexes its elements with keys. The type of a dictionary in Python is dict. In other computer science worlds, a Python dict might be called an associative container because it associates to each key a value.

dct = {
  "MATH 131": "Introduction to Python",
  "MATH 450": True,
  "pi": x,
  "list": lst
  # key: value,
}
dct["pi"]
'MATH 131'

The dict referenced by the name dct is created using curcly braces, instead of square brackets like for a list. Between the curly braces the pattern repeats key: value pairs separated by commas, where the colon : distinguishes the key from the value. There are lots of options for the key types, but it will be easiest if we think of the keys as specifically type str for now.

The dict dct associates with key "pi" the value stored in the variable x, that we created earlier. The key "list" associates the list l.

The options for indexing dicts are, in some sense, more restrictive than for lists. There are no slices with dicts. However, one can mutate (think edit) a dict by indexing into a key that doesn’t exist in order to mutate the dict dct and associate the new key with a specified value. If the key happens to exist already, the equals sign below will instead over-write the value previously associated with the key.

dct["python"] = "seems fancy"
dct["python"]
'seems fancy'

These data structures show up all over Python code. Python lists are helpful when you want to reference data of not necessarily the same type, with one variable name, in an order that makes sense to index by sequential integers. Python dicts are helpful when the order of the contained elements does not necessarily make sequential sense and instead it is easier to associate the contained elements by key/name.

Packages#

There are many more types built-in to Python and many more types not directly built-in. Communities of programmers, like that of the data science community and before them the numerical computing community, have built types into packages for very specific use-cases. Think of packages as additional features one can add-on to Python when you need/want.

In the last sections of Week 01, we’ll introduce the type ndarray from the Python packages NumPy and the types DataFrame and Series from the package Pandas. We’ll explore the type DataFrame by example, using a dataset that is bundled with the plotting package plotnine. All of these packages are important for any data analysis using the programming language Python.

One must install and then import a package in order to use it. At the top of each Colab notebook, I will have placed code to install the necessary packages into the notebook environment. Once installed, to import the packages plotnine, NumPy, and Pandas, the following code is common.

import plotnine

import numpy as np
import pandas as pd

The general code used to import a package is import package_name, like shown for plotnine, but for some packages it is common to rename the packages. NumPy is often imported as np, so that anytime you need to reference any part of the NumPy package, you lead with np, e.g. np.array, np.mean, or np.size. Similarly, the package Pandas is often renamed as pd.

Array#

From the Python package NumPy, the type ndarray is used to contain multiple elements, all of the same type. The name of the array type, ndarray is meant to stand for n-dimensional array. The most common element types of NumPy ndarrays are bool, int, or float. Although, for reasons we won’t cover, NumPy uses its own types, analogous to the ones just mentioned. The table below lists some of the Python types we discussed above and the analogous NumPy types that are often contained within a NumPy ndarray.

Python

NumPy

bool

bool_

int

int_ or int64

float

double or float64

The code below creates a NumPy array of np.float64s, since at least one element of our input list contains a float.

import numpy as np # this line shows up once at the top of each notebook,
# not once per code cell

a = np.array([1., 2, 3])
a
array([1., 2., 3.])

With a NumPy array, like a above, you can calculate the mean or the sum of the number of elements in the array with code like the following.

Note

Note that only the last line of code actually prints, even if all lines of code are executed.

np.mean(a)
np.sum(a)
np.size(a)
3

The array a above is only one dimensional. np.ndarrays, read as NumPy arrays, can be multidimensional, hence the name ndarray. The array A below is initialized to hold 2 dimensions of ones; two rows and three columns of ones, for a total of 6 ones. Then A is multiplied by the mean of a. Notice that the multiplication happens element-wise, that is each element of A is multiplied by np.mean(a), and the output is a new array with the same shape that A was initialized to.

A = np.ones(shape = (2, 3))
A * np.mean(a)
array([[2., 2., 2.],
       [2., 2., 2.]])

The shape of A, and this newly created, un-referenced array, is effectively 2 rows, each of size 3.

The function np.shape returns the shape of the array, while np.size returns the total number of elements in the array.

print(np.shape(A)) # print(...) forces printing
np.size(A)
(2, 3)
6

There is a lot more to say about NumPy’s type ndarray. For now, this is enough and we’ll move on to Pandas DataFrame and Series types, since we’ll spend most time in this class using dataframes.

DataFrames#

Note

To use the package Pandas, you normally need to execute import Pandas as pd. Such a line of code only needs to happen once per notebook, and it should occur near the top of the notebook. We will instead import a dataframe from the package plotnine for our example.

The Python package Pandas has two important types for data analysis: pd.Series and pd.DataFrame. The pd.DataFrame type is used for tabular or rectangular data; think data that would fit well in a spreadsheet.

The type pd.Series is intended to contain multiple elements, all of the same type. In this aspect, pd.Series is similar to np.ndarray. Pandas created their own homogeneous type container, pd.Series to more naturally wrap multiple pd.Seriess into a DataFrame. And that’s exactly what pd.DataFrame is, one container of multiple Series, each of which has its own element type and makes up one column of the tabular data. So pd.DataFrame is a heterogeneous type container, made up of multiple homogeneous type columns. A further requirement of pd.DataFrame is that each contained Series must have the same size, hence pd.DataFrame holds rectangular data.

Let’s see an example. We’ll import the DataFrame diamonds from the Python package plotnine.

from plotnine.data import diamonds

diamonds
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 10 columns

Each of the columns

diamonds.columns
Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

has a name and each row corresponds to a row index, starting at index 0, just like a NumPy arrray. Further, each column has one specific type for all the elements contained in that column. The non-numeric type category is for columns which contain names or labels.

diamonds.dtypes
carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
price         int64
x           float64
y           float64
z           float64
dtype: object

Notice that I accessed a property .columns of the dataset diamonds, just like we previously accessed types and functions from the packages NumPy or Pandas, e.g. np.array, np.mean, pd.DataFrame. We’ll cover the details of such . access later. For now, let’s continue to use properties associated with a DataFrame to learn about the type pd.DataFrame.

Each DataFrame is two dimensional, so has a shape, a numer of rows and number of columns, and a size. There are 53,940 observations and 10 variables in the diamonds dataset.

print(diamonds.shape)
diamonds.size
(53940, 10)
539400

To index one column from a DataFrame, use the associated property .loc. Because the associated property .loc is not a function, but instead a property for which you index rows and/or columns, you should use squre brackets as we did for indexing lists and dicts. Inside the square brackets goes an index for rows and then columns, separated by a comma, e.g. pd.DataFrame.loc[rows, cols].

diamonds.loc[:, "depth"]
0        61.5
1        59.8
2        56.9
3        62.4
4        63.3
         ... 
53935    60.8
53936    63.1
53937    62.8
53938    61.0
53939    62.2
Name: depth, Length: 53940, dtype: float64

The colon : acts as a slice of ints indexing rows 0 up to and including the last row. Check for yourself that the type of the retrieved column is pd.Series.

Indexing a DataFrame by column is so common, that you can even retreive a column by indexing a DataFrame like a dict, where the key is the name of the column. Check for yourself that the type of the retrived columns is pd.Series.

diamonds["depth"]
0        61.5
1        59.8
2        56.9
3        62.4
4        63.3
         ... 
53935    60.8
53936    63.1
53937    62.8
53938    61.0
53939    62.2
Name: depth, Length: 53940, dtype: float64

You can also retreive more than one column at once. The key in this case is a list of column names, and the return value is a subset of the DataFrame you are indexing.

diamonds[["depth", "cut"]]
depth cut
0 61.5 Ideal
1 59.8 Premium
2 56.9 Good
3 62.4 Premium
4 63.3 Good
... ... ...
53935 60.8 Ideal
53936 63.1 Good
53937 62.8 Very Good
53938 61.0 Premium
53939 62.2 Ideal

53940 rows × 2 columns

Since each column is a Series, which is analogous to np.ndarray, you can calculate the size, mean, standard deviation, minimum, percentile/quantiles, or maximum of a DataFrame column. Or you can use the function describe which is associated with both pd.Series and pd.DataFrame types.

depth = diamonds["depth"]
print(np.size(depth))
print(np.mean(depth))
print(np.std(depth))
print(np.quantile(depth, [0.25, 0.5, 0.75]))
depth.describe()
53940
61.749404894327036
1.4326080390046028
[61.  61.8 62.5]
count    53940.000000
mean        61.749405
std          1.432621
min         43.000000
25%         61.000000
50%         61.800000
75%         62.500000
max         79.000000
Name: depth, dtype: float64

Only the columns with numeric types, float64 or int64 are described. The columns of type category are automatically removed before the numeric summaries are calculated.

diamonds.describe()
carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000

You can index a subset of rows with slices, but with pd.DataFrames, a slice like start:stop will include the value stop. The following code therefore retreives rows 2 up to and including 10, and both columns "depth" and "clarity". Notice that to get multiple columns, you must wrap the column names in a list. Check for yourself that the type of the object returned is pd.DataFrame.

diamonds.loc[2:10, ["depth", "clarity"]]
depth clarity
2 56.9 VS1
3 62.4 VS2
4 63.3 SI2
5 62.8 VVS2
6 62.3 VVS1
7 61.9 SI1
8 65.1 VS2
9 59.4 VS1
10 64.0 SI1

Note

Quoted straight from the Pandas documentation: “Note that contrary to usual python slices, both the start and the stop are included”

The options for indexing pd.DataFrames and pd.Seriess are, in some sense, more powerful than for lists. Let’s consider indexing by a Series of booleans. The variable idx is a boolean Series that indexes all the rows where the logical statement is true, that is when a diamond’s cut is equal to “Very Good”. All rows where a diamond’s cut is not equal to “Very Good” are indexed with False.

idx = diamonds.loc[:, "cut"] == "Very Good"
idx
0        False
1        False
2        False
3        False
4        False
         ...  
53935    False
53936    False
53937     True
53938    False
53939    False
Name: cut, Length: 53940, dtype: bool

We can then use the variable idx to index another column of diamonds. Say we want to calculate the mean of the column price, for only the “Very Good” cut diamonds.

np.mean(diamonds.loc[idx, "price"])
np.float64(3981.7598907465654)

One can also quickly flip all of idx’s Trues to Falses and all Falses to True, so as to compare the mean price of not “Very Good” cut diamonds. Notice the ~, read tilde, in front of idx.

np.mean(diamonds.loc[~idx, "price"])
np.float64(3918.667733766544)

See also

For more about indexing in Pandas see Selection by label and Selection by position. Such tools have a steep learning curve and a huge payoff.