Week 06 and beyond#
Python w/out Google Colab#
Creating you own dataset#
When you get to the point that you start creating your own small to medium sized datasets, then this section is for you. This section explains some general advice surrounding creating a dataset.
Entering data into a spreadsheet is easy. And that’s good. But there are some gotchas that you should avoid. Below you’ll find lists of the dos and then explanations, and the don’ts and explanations, for creating your own datasets.
DOs#
be consistent
use simple variable names
prefer all lower case letters
minimize numbers and special characters
use underscore
_
instead of space
organize files within directories
Be consistent. When programming, having to repeated look back at your spreadsheet to figure out your variable names is beyond annoying. It is beyond annoying because it interrupts your programming. Programming is hard enough, try to minimize inconsistencies that can otherwise be settled by being consistent.
Use simple variable names. Consider two variables you might want
to name with multiple words, like miles per gallon and brain to body
weight ratio. It is easy to name one variable using camel case,
e.g. MilesPerGallon
, and another capitalized,
e.g. Brain(g)bodyWeight(KG)
. The first name is fine, so long as you
are consistent and choose camel case for all of your variable names.
The second variable name is both not simple and inconsistent. Camel
case would have you capitalize each new work, as in BrainBodyWight
.
In this case, even the units are not capitalized the same. This is a
recipe for frustration. Also see below, don’t put units in variabl
names.
It is recommended to make yourself a simple rule, like prefer all lowercase letters. Maybe that’s not the rule for you, but don’t get caught up on the rule. The rule itself doesn’t matter. Just be simple and consistently so.
My go to rule is all lowercase letters, no numbers or special
characters other than _
, and to separate words when there are
contiguous repeated letters, ee
or ss
, and otherwise don’t
separate words. The separator I prefer is underscore _
instead of
space
, which is mostly a carry over rule from programming in R.
Remember, the rule matters less than consistency with the rule.
Organize files within directories. When editing files, it is
tempting to write metadata into the file name. For instance, it is
unfortunately common for people to write file names such as
draft_manuscript.docx
, draft2_manuscript.docx
,
draft3_manuscript.docx
, final_manuscript.docx
,
final_final_manuscript.docx
. File names are not intended to carry
the metadata associated with draft versions.
If you really need to maintain copies of drafts, and I guess you most
often do not need such copies, then you should create directories such as
draft
and final
. Each directory should contain a (singular) copy
of the files you absolutely need with each and every copy of the file.
Any files, such as data, that are the same for all copies of the file
should have their own directory. It might help future you to put a
separate notes file in each directory that reminds you of exact
purpose of the directory.
DON’Ts#
don’t start a variable name with a number
don’t use special character in variable names
don’t put units in variable names
don’t use abbreviations
don’t organize through file names
don’t put dates in your file names
don’t have multiple copies of your data
Don’t start a variable name with a number. In most programming languages, you can’t start a variable name with a number. So it’s easiest to just avoid putting numbers in variable names altogether. Occassionaly, it makes sense to use a number in a variable name. Just don’t start your variable name with a number.
Don’t use special characters in variable names. This rule is much
like the rule above. In my experience, special characters,
e.g. ~!@#$%^&*()+=,<>/|\
, only make remembering a variable name more
difficult. The only special character that you should allow, when
necessary, in your variable names is underscore _
. See Use simple
variable names above.
Don’t put units in variable names. Units in variable names just open the door for inconsistent variable names. It is easiest to just avoid putting units or other metadata into variable names. Your data should instead have a separate file of all the associated metadata.
Don’t use abbreviations. Abbreviated variable names are
attractive, because they save typing. For instance, one could imagine
abbreviating micrograms as ug
, mg
, or μg
. This creates
opportunity for misremembering and inconsistency. Such abbreviations
in variable names also breaks the rule Don’t put units in variable
names. Further, see Use simple variable names above above.
Instead, put such metadata in a separate file.
Don’t organize through file names. The only metadata a file name should contain is the name of the file. Instead, use directories to organize your files. See Organize files within directories above.
Don’t put dates in your file names. Dates are metadata, see Don’t organize through file names above.
Don’t have multiple copies of your data. Generally, you should only have one copy of your dataset. See Don’t put dates in your file names above. If there are necessary edits to your data for a specific analysis, then you should program those edits in Python code and save that code for future re-use. This way you can re-create data changes as necessary, and you minimize introducing permanent errors into your dataset.
tidy data#
The most complete reference containing the advice above, and more, is from Hadley Wickham’s paper Tidy Data (pdf). The paper lays out a framework with the goal of making it easier to clean up (tidy) data, so that subsequent analysis is easier.