Exploring and understanding data

Last updated on 2025-10-29 | Edit this page

Overview

Questions

  • How can I do exploratory data analysis in Python?
  • How do I get help when I am stuck?
  • What impact does an object’s type have on what I can do with it?
  • How are expressions evaluated and values assigned to variables?

Objectives

  • Explore the structure and content of pandas dataframes
  • Convert data types and handle missing data
  • Interpret error messages and develop strategies to get help with Python
  • Trace how Python assigns values to objects

The pandas DataFrame


We just spent quite a bit of time learning how to create visualisations from the samples data, but we did not talk much about what samples is. Let’s first load the data again:

PYTHON

samples = pd.read_csv('../data/surveys_complete_77_89.csv')

You may remember that we loaded the data into Python with the pandas.read_csv function. The output of read_csv is a data frame: a common way of representing tabular data in a programming language. To be precise, samples is an object of type DataFrame. In Python, pretty much everything you work with is an object of some type. The type function can be used to tell you the type of any object you pass to it.

PYTHON

type(samples)

OUTPUT

pandas.core.frame.DataFrame

This output tells us that the DataFrame object type is defined by pandas, i.e. it is a special type of object not included in the core functionality of Python.

Exploring data in a dataframe

We encountered the plot, head and tail methods in the previous epsiode. Dataframe objects carry many other methods, including some that are useful when exploring a dataset for the first time. Consider the output of describe:

PYTHON

samples.describe()

OUTPUT

              month           day 	       year       plot_id  hindfoot_length 	      weight
count  16878.000000  16878.000000  16878.000000  16878.000000     14145.000000  15186.000000
mean       6.382214     15.595805   1983.582119     11.471442        31.982114     53.216647
std        3.411215      8.428180      3.492428      6.865875        10.709841     44.265878
min        1.000000      1.000000   1977.000000      1.000000         6.000000      4.000000
25%        3.000000      9.000000   1981.000000      5.000000        21.000000     24.000000
50%        6.000000     15.000000   1983.000000     11.000000        35.000000     42.000000
75%        9.000000     23.000000   1987.000000     17.000000        37.000000     53.000000
max       12.000000     31.000000   1989.000000     24.000000        70.000000    278.000000

These summary statistics give an immediate impression of the distribution of the data. It is always worth performing an initial “sniff test” with these: if there are major issues with the data or its formatting, they may become apparent at this stage.

info provides an overview of the columns included in the dataframe:

PYTHON

samples.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
Index: 16878 entries, 1 to 16878
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   month            16878 non-null  int64
 1   day              16878 non-null  int64
 2   year             16878 non-null  int64
 3   plot_id          16878 non-null  int64
 4   species_id       16521 non-null  object
 5   sex              15578 non-null  object
 6   hindfoot_length  14145 non-null  float64
 7   weight           15186 non-null  float64
 8   genus            16521 non-null  object
 9   species          16521 non-null  object
 10  taxa             16521 non-null  object
 11  plot_type        16878 non-null  object
dtypes: float64(2), int64(4), object(6)
memory usage: 1.7+ MB

We get quite a bit of useful information here too. First, we are told that we have a DataFrame of 16878 entries, or rows, and 12 variables, or columns.

Next, we get a bit of information on each variable, including its column title, a count of the non-null values (that is, values that are not missing), and something called the dtype of the column.

Data types

The dtype property of a dataframe column describes the data type of the values stored in that column. There are three in the example above:

  • int64: this column contains integer (whole number) values.
  • object: this column contains string (non-numeric sequence of characters) values.
  • float64: this column contains “floating point” values i.e. numeric values containing a decimal point.

The 64 after int and float represents the level of precision with which the values in the column are stored in the computer’s memory. Other types with lower levels of precision are available for numeric values, e.g. int32 and float16, which will take up less memory on your system but limit the size and level of precision of the numbers they can store.

The dtype of a column is important because it determines the kinds of operation that can be performed on the values in that column. Let’s work with a couple of the columns independently to demonstrate this.

The Series object

To work with a single column of a dataframe, we can refer to it by name in two different ways:

PYTHON

samples['species_id']

or

PYTHON

samples.species_id # this only works if there are no spaces in the column name (note the underscore used here)

OUTPUT

record_id
1        NL
2        NL
3        DM
4        DM
5        DM
         ..
16874    RM
16875    RM
16876    DM
16877    DM
16878    DM
Name: species_id, Length: 16878, dtype: object
Callout

Tip: use tab completion on column names

Tab completion, where you start typing the name of a variable, function, etc before hitting Tab to auto-complete the rest, also works on column names of a dataframe. Since this tab completion saves time and reduces the chance of including typos, we recommend you use it as frequently as possible.

The result of that operation is a series of data: a one-dimensional sequence of values that all have the same dtype (object in this case). Dataframe objects are collections of the series “glued together” with a shared index: the column of unique identifiers we associate with each row. record_id is the index of the series summarised above; the values carried by the series are NL, DM, AH, etc (short species identification codes).

If we choose a different column of the dataframe, we get another series with a different data type:

PYTHON

samples['weight']

OUTPUT

record_id
1         NaN
2         NaN
3         NaN
4         NaN
5         NaN
         ...
16874    15.0
16875     9.0
16876    31.0
16877    50.0
16878    42.0
Name: weight, Length: 16878, dtype: float64

The data type of the series influences the things that can be done with/to it. For example, sorting works differently for these two series, with the numeric values in the weight series sorted from largest to smallest and the character strings in species_id sorted alphabetically:

PYTHON

samples['weight'].sort_values()

OUTPUT

record_id
9790     4.0
5346     4.0
4052     4.0
9853     4.0
7084     4.0
        ...
16772    NaN
16777    NaN
16808    NaN
16846    NaN
16860    NaN
Name: weight, Length: 16878, dtype: float64

PYTHON

samples['species_id'].sort_values()

OUTPUT

record_id
12345     AB
9861      AB
10970     AB
10963     AB
5759      AB
        ...
16453    NaN
16454    NaN
16488    NaN
16489    NaN
16539    NaN
Name: species_id, Length: 16878, dtype: object

This pattern of behaviour, where the type of an object determines what can be done with it and influences how it is done, is a defining characteristic of Python. As you gain more experience with the language, you will become more familiar with this way of working with data. For now, as you begin on your learning journey with the language, we recommend using the type function frequently to make sure that you know what kind of data/object you are working with, and do not be afraid to ask for help whenever you are unsure or encounter a problem.

Aside: Getting Help


You may have already encountered several errors while following the lesson and this is a good time to take a step back and discuss good strategies to get help when something goes wrong.

The built-in help function

Use help to view documentation for an object or function. For example, if you want to see documentation for the round function:

PYTHON

help(round)

OUTPUT

Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.

    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.

The Jupyter Notebook has two ways to get help.

If you are working in Jupyter (Notebook or Lab), the platform offers some additional ways to see documentation/get help:

  • Option 1: Type the function name in a cell with a question mark after it, e.g. round?. Then run the cell.
  • Option 2: (Not available on all systems) Place the cursor near where the function is invoked in a cell (i.e., the function name or its parameters),
    • Hold down Shift, and press Tab.
    • Do this several times to expand the information returned.

Understanding error messages

The error messages returned when something goes wrong can be (very) long but contain information about the problem, which can be very useful once you know how to interpret it. For example, you might receive a SyntaxError if you mistyped a line and the resulting code was invalid:

PYTHON

# Forgot to close the quote marks around the string.
name = 'Feng

ERROR

  Cell In[129], line 1
    name = 'Feng
           ^
SyntaxError: unterminated string literal (detected at line 1)

There are three parts to this error message:

ERROR

  Cell In[129], line 1

This tells us where the error occured. This is of limited help in Jupyter, since we know that the error is in the cell we just ran (Cell In[129]), but the line number can be helpful especially when the cell is quite long. But when running a larger program written in Python, perhaps built up from multiple individual scripts, this can be more useful, e.g.

ERROR

  data_visualisation.py, line 42

Next, we see a copy of the line where the error was encountered, often annotated with an arrow pointing out exactly where Python thinks the problem is:

ERROR

    name = 'Feng
           ^

Python is not exactly right in this case: from context you might be able to guess that the issue is really the lack of a closing quotation mark at the end of the line. But an arrow pointing to the opening quotation mark can give us a push in the right direction. Sometimes Python gets these annotations exactly right. Occasionally, it gets them completely wrong. In the vast majority of cases they are at least somewhat helpful.

Finally, we get the error message itself:

ERROR

SyntaxError: unterminated string literal (detected at line 1)

This always begins with a statement of the type of error encountered: in this case, a SyntaxError. That provides a broad categorisation for what went wrong. The rest of the message is a description of exactly what the problem was from Python’s perspective. Error messages can be loaded with jargon and quite difficult to understand when you are first starting out. In this example, unterminated string literal is a technical way of saying “you opened some quotes, which I think means you were trying to define a string value, but the quotes did not get closed before the end of the line.”

It is normal not to understand exactly what these error messages mean the first time you encounter them. Since programming involves making lots of mistakes (for everyone!), you will start to become familiar with many of them over time. As you continue learning, we recommend that you ask others for help: more experienced programmers have made all of these mistakes before you and will probably be better at spotting what has gone wrong. (More on asking for help below.)

Error output can get really long!

Especially when using functions from libraries you have imported into your program, the middle part of the error message (the traceback) can get rather long. For example, what happens if we try to access a column that does not exist in our dataframe?

PYTHON

samples['wegiht'] # misspelling the 'weight' column name

ERROR

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'wegiht'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[131], line 1
----> 1 samples['wegiht']

File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
   4105 if self.columns.nlevels > 1:
   4106     return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
   4108 if is_integer(indexer):
   4109     indexer = [indexer]

File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'wegiht'

(This is still relatively short compared to some errors messages we have seen!)

When you encounter a long error like this one, do not panic! Our advice is to focus on the first couple of lines and the last couple of lines. Everything in the middle (as the name traceback suggests) is retracing steps through the program, identifying where problems were encountered along the way. That information is only really useful to somebody interested in the inner workings of the pandas library, which is well beyond the scope of this lesson! If we ignore everything in the middle, the parts of the error message we want to focus on are:

ERROR

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

[... skipping these parts ...]

KeyError: 'wegiht'

This tells us that the problem is the “key”: the value we used to lookup the column in the dataframe. Hopefully, the repetition of the value we provided would be enough to help us realise our mistake.

Other ways to get help

There are several other ways that people often get help when they are stuck with their Python code.

  • Search the internet: paste the last line of your error message or the word “python” and a short description of what you want to do into your favourite search engine and you will usually find several examples where other people have encountered the same problem and came looking for help.
    • StackOverflow can be particularly helpful for this: answers to questions are presented as a ranked thread ordered according to how useful other users found them to be.
    • Take care: copying and pasting code written by somebody else is risky unless you understand exactly what it is doing!
  • ask somebody “in the real world”. If you have a colleague or friend with more expertise in Python than you have, show them the problem you are having and ask them for help.
  • Sometimes, the act of articulating your question can help you to identify what is going wrong. This is known as “rubber duck debugging” among programmers.

Generative AI

It is increasingly common for people to use generative AI chatbots such as ChatGPT to get help while coding. You will probably receive some useful guidance by presenting your error message to the chatbot and asking it what went wrong. However, the way this help is provided by the chatbot is different. Answers on StackOverflow have (probably) been given by a human as a direct response to the question asked. But generative AI chatbots, which are based on an advanced statistical model, respond by generating the most likely sequence of text that would follow the prompt they are given.

While responses from generative AI tools can often be helpful, they are not always reliable. These tools sometimes generate plausible but incorrect or misleading information, so (just as with an answer found on the internet) it is essential to verify their accuracy. You need the knowledge and skills to be able to understand these responses, to judge whether or not they are accurate, and to fix any errors in the code it offers you.

In addition to asking for help, programmers can use generative AI tools to generate code from scratch; extend, improve and reorganise existing code; translate code between programming languages; figure out what terms to use in a search of the internet; and more. However, there are drawbacks that you should be aware of.

The models used by these tools have been “trained” on very large volumes of data, much of it taken from the internet, and the responses they produce reflect that training data, and may recapitulate its inaccuracies or biases. The environmental costs (energy and water use) of LLMs are a lot higher than other technologies, both during development (known as training) and when an individual user uses one (also called inference). For more information see the AI Environmental Impact Primer developed by researchers at HuggingFace, an AI hosting platform. Concerns also exist about the way the data for this training was obtained, with questions raised about whether the people developing the LLMs had permission to use it. Other ethical concerns have also been raised, such as reports that workers were exploited during the training process.

We recommend that you avoid getting help from generative AI during the workshop for several reasons:

  1. For most problems you will encounter at this stage, help and answers can be found among the first results returned by searching the internet.
  2. The foundational knowledge and skills you will learn in this lesson by writing and fixing your own programs are essential to be able to evaluate the correctness and safety of any code you receive from online help or a generative AI chatbot. If you choose to use these tools in the future, the expertise you gain from learning and practising these fundamentals on your own will help you use them more effectively.
  3. As you start out with programming, the mistakes you make will be the kinds that have also been made – and overcome! – by everybody else who learned to program before you. Since these mistakes and the questions you are likely to have at this stage are common, they are also better represented than other, more specialised problems and tasks in the data that was used to train generative AI tools. This means that a generative AI chatbot is more likely to produce accurate responses to questions that novices ask, which could give you a false impression of how reliable they will be when you are ready to do things that are more advanced.

Data input within Python


Although it is more common (and faster) to input data in another format e.g. a spreadsheet and read it in, Series and DataFrame objects can be created directly within Python. Before we can make a new Series, we need to learn about another type of data in Python: the list.

Lists

Lists are one of the standard data structures built into Python. A data structure is an object that contains more than one piece of information. (DataFrames and Series are also data structures.) The list is designed to contain multiple values in an ordered sequence: they are a great choice if you want to build up and modify a collection of values over time and/or handle each of those values one at a time. We can create a new list in Python by capturing the values we want it to store inside square brackets []:

PYTHON

years_list = [2020, 2025, 2010]
years_list

OUTPUT

[2020, 2025, 2010]

New values can be added to the end of a list with the append method:

PYTHON

years_list.append(2015)
years_list

OUTPUT

[2020, 2025, 2010, 2015]
Challenge

Exploring list methods

The append method allows us to add a value to the end of a list but how could we insert a new value into a given position instead? Applying what you have learned about how to find out the methods that an object has, can you figure out how to place the value 2019 into the third position in years_list (shifting the values after it up one more position)? Recall that the indexing used to specify positions in a sequence begins at 0 in Python.

Using tab completion, the help function, or looking up the documentation online, we can discover the insert method and learn how it works. insert takes two arguments: the position for the new list entry and the value to be placed in that position:

PYTHON

years_list.insert(2, 2019)
years_list

OUTPUT

[2020, 2025, 2019, 2010, 2015]

Among many other methods is sort, which can be used to sort the values in the list:

PYTHON

years_list.sort()
years_list

OUTPUT

[2010, 2015, 2019, 2020, 2025]

The easiest way to create a new Series is with a list:

PYTHON

years_series = pd.Series(years_list)
years_series

OUTPUT

0    2010
1    2015
2    2019
3    2020
4    2025
dtype: int64

With the data in a Series, we can no longer do some of the things we were able to do with the list, such as adding new values. But we do gain access to some new possibilities, which can be very helpful. For example, if we wanted to increase all of the values by 1000, this would be easy with a Series but more complicated with a list:

PYTHON

years_series + 1000

OUTPUT

0    3010
1    3015
2    3019
3    3020
4    3025
dtype: int64

PYTHON

years_list + 1000

ERROR

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[126], line 1
----> 1 years_list + 1000

TypeError: can only concatenate list (not "int") to list

This illustrates an important principle of Python: different data structures are suitable for different “modes” of working with data. It can be helpful to work with a list when building up an initial set of data from scratch, but when you are ready to begin operating on that dataset as a whole (performing calculations with it, visualising it, etc), you will be rewarded for switching to a more specialised datatype like a Series or DataFrame from pandas.

Unexpected data types


Operations like the addition of 1000 we performed on years_series work because pandas knows how to add a number to the integer values in the series. That behaviour is determind by the dtype of the series, which makes that dtype really important for how you want to work with your data. Let’s explore how the dtype is chosen. Returning to the years_series object we created above:

PYTHON

years_series

OUTPUT

0    2010
1    2015
2    2019
3    2020
4    2025
dtype: int64

The dtype: int64 was determined automatically based on the values passed in. But what if the values provided are of several different types?

PYTHON

ages_series = pd.Series([2, 3, 5.5, 6, 8])
ages_series

OUTPUT

0    2.0
1    3.0
2    5.5
3    6.0
4    8.0
dtype: float64

Pandas assigns a dtype that allows it to account for all of the values it is given, converting some values to another dtype if needed, in a process called coercion. In the case above, all of the integer values were coerced to floating point numbers to account for the 5.5.

Challenge

Exercise: Coercion

Can you guess the dtype of each series created below? Run the code to check whether you were right.

PYTHON

int_str = pd.Series([1, 'two', 3])
str_flt = pd.Series(['four', 5.0, 'six'])

PYTHON

int_str

OUTPUT

0      1
1    two
2      3
dtype: object

PYTHON

str_flt

OUTPUT

0   four
1    5.0
2    six
dtype: object

In both cases, the numeric values are coerced into strings. When automatically coercing values between types like this, Python aims to minimise the amount of information lost.

In practice, it is much more common to read data from elsewhere (e.g. with read_csv) than to enter it manually within Python. When reading data from a file, pandas tries to guess the appropriate dtype to assign to each column (series) of the dataframe. This is usually very helpful but the process is sensitive to inconsistencies and data entry errors in the input: a stray character in one cell can cause an entire column to be coerced to a different dtype than you might have wanted.

For example, if the raw data includes a symbol added by a typo mistake (= instead of -):

name,latitude,longitude
Superior,47.7,-87.5
Victoria,-1.0,33.0
Tanganyika,=6.0,29.3

We see a non-numeric dtype for the latitude column (object) when we load the data into a dataframe.

PYTHON

lakes = pd.read_csv('../data/lakes.csv')
print(lakes)
lakes.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   name       3 non-null      object
 1   latitude   3 non-null      object
 2   longitude  3 non-null      float64
dtypes: float64(1), object(2)
memory usage: 200.0+ bytes

It is a good idea to run the info method on a new dataframe after you have loaded data for the first time: if one or more of the columns has a different dtype than you expected, this may be a signal that you need to clean up the raw data.

Recasting

A column can be manually coerced (or recast) into a different dtype, provided that pandas knows how to handle that conversion. For example, the integer values in the plot_id column of our dataframe can be converted to floating point numbers:

PYTHON

samples['plot_id'] = samples['plot_id'].astype('float')
samples['plot_id']

OUTPUT

record_id
record_id
1         2.0
2         3.0
3         2.0
4         7.0
5         3.0
         ...
16874    16.0
16875     5.0
16876     4.0
16877    11.0
16878     8.0
Name: plot_id, Length: 16878, dtype: float64

But the string values of species_id cannot be converted to numeric data:

PYTHON

samples.species_id = samples['species_id'].astype('int')

ERROR

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[101], line 1
----> 1 samples.species_id = samples.species_id.astype('int64')

File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/generic.py:6662, in NDFrame.astype(self, dtype, copy, errors)
   6656     results = [
   6657         ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
   6658     ]
   6660 else:
   6661     # else, only a single dtype is given
-> 6662     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6663     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6664     return res.__finalize__(self, method="astype")

[... a lot more lines of traceback ...]

File ~/miniforge3/envs/carpentries/lib/python3.11/site-packages/pandas/core/dtypes/astype.py:133, in _astype_nansafe(arr, dtype, copy, skipna)
    129     raise ValueError(msg)
    131 if copy or arr.dtype == object or dtype == object:
    132     # Explicit copy, or required since NumPy can't view from / to object.
--> 133     return arr.astype(dtype, copy=True)
    135 return arr.astype(dtype, copy=copy)

ValueError: invalid literal for int() with base 10: 'NL'
Challenge

Changing Types

  1. Convert the values in the column plot_id back to integers.
  2. Now try converting weight to an integer. What goes wrong here? What is pandas telling you? We will talk about some solutions to this later.

PYTHON

samples['plot_id'].astype('int')

OUTPUT

record_id
1         2
2         3
3         2
4         7
5         3
         ..
16874    16
16875     5
16876     4
16877    11
16878     8
Name: plot_id, Length: 16878, dtype: int64

PYTHON

samples['weight'].astype('int')

ERROR

pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

Pandas cannot convert types from float to int if the column contains missing values.

Missing Data

In addition to data entry errors, it is common to encounter missing values in large volumns of data. A value may be missing because it was not possible to make a complete observation, because data was lost, or for any number of other reasons. It is important to consider missing values while processing data because they can influence downstream analysis – that is, data analysis that will be done later – in unwanted ways when not handled correctly.

Depending on the dtype of the column/series, missing values may appear as NaN (“Not a Number”), NA, <NA>, or NaT (“Not a Time”). You may have noticed some during our initial exploration of the dataframe. (Note the NaN values in the first five rows of the weight column below.)

PYTHON

samples.head()

OUTPUT

           month  day  year  plot_id  species_id  sex  hindfoot_length  weight      genus   species    taxa                 plot_type
record_id 												
1              7   16  1977        2          NL    M             32.0     NaN    Neotoma  albigula  Rodent                   Control
2              7   16  1977        3          NL    M             33.0     NaN    Neotoma  albigula  Rodent  Long-term Krat Exclosure
3              7   16  1977        2          DM    F             37.0     NaN  Dipodomys  merriami  Rodent                   Control
4              7   16  1977        7          DM    M             36.0     NaN  Dipodomys  merriami  Rodent          Rodent Exclosure
5              7   16  1977        3          DM    M             35.0     NaN  Dipodomys  merriami  Rodent  Long-term Krat Exclosure

The output of the info method includes a count of the non-null values – that is, the values that are not missing – for each column:

PYTHON

samples.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
Index: 16878 entries, 1 to 16878
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   month            16878 non-null  int64
 1   day              16878 non-null  int64
 2   year             16878 non-null  int64
 3   plot_id          16878 non-null  int64
 4   species_id       16521 non-null  object
 5   sex              15578 non-null  object
 6   hindfoot_length  14145 non-null  float64
 7   weight           15186 non-null  float64
 8   genus            16521 non-null  object
 9   species          16521 non-null  object
 10  taxa             16521 non-null  object
 11  plot_type        16878 non-null  object
dtypes: float64(2), int64(4), object(6)
memory usage: 1.7+ MB

From this output we can tell that almost 1700 weight measurements and more than 2700 hindfoot length measurements are missing. Many of the other columns also have missing values.

The ouput above demonstrates that pandas can distinguish these NaN values from the actual data and indeed they will be ignored for some tasks, such as calculation of the summary statistics provided by describe.

PYTHON

samples.describe()

OUTPUT

              month           day 	       year       plot_id  hindfoot_length 	      weight
count  16878.000000  16878.000000  16878.000000  16878.000000     14145.000000  15186.000000
mean       6.382214     15.595805   1983.582119     11.471442        31.982114     53.216647
std        3.411215      8.428180      3.492428      6.865875        10.709841     44.265878
min        1.000000      1.000000   1977.000000      1.000000         6.000000      4.000000
25%        3.000000      9.000000   1981.000000      5.000000        21.000000     24.000000
50%        6.000000     15.000000   1983.000000     11.000000        35.000000     42.000000
75%        9.000000     23.000000   1987.000000     17.000000        37.000000     53.000000
max       12.000000     31.000000   1989.000000     24.000000        70.000000    278.000000

In some circumstances, like the recasting we attempted in the previous exercise, the missing values can cause trouble. It is up to us to decide how best to handle those missing values. We could remove the rows containing missing data, accepting the loss of all data for that observation:

PYTHON

samples.dropna().head()

OUTPUT

           month  day  year  plot_id species_id sex  hindfoot_length  weight      genus   species    taxa                 plot_type
record_id
63             8   19  1977        3         DM   M             35.0    40.0  Dipodomys  merriami  Rodent  Long-term Krat Exclosure
64             8   19  1977        7         DM   M             37.0    48.0  Dipodomys  merriami  Rodent          Rodent Exclosure
65             8   19  1977        4         DM   F             34.0    29.0  Dipodomys  merriami  Rodent                   Control
66             8   19  1977        4         DM   F             35.0    46.0  Dipodomys  merriami  Rodent                   Control
67             8   19  1977        7         DM   M             35.0    36.0  Dipodomys  merriami  Rodent          Rodent Exclosure

But we should take note that this removes more than 3000 rows from the dataframe!

PYTHON

len(samples)

OUTPUT

16878

PYTHON

len(samples.dropna())

OUTPUT

13773

Instead, we could fill all of the missing values with something else. For example, let’s make a copy of the samples dataframe then populate the missing values in the weight column of that copy with the mean of all the non-missing weights. There are a few parts to that operation, which are tackled one at a time below.

PYTHON

mean_weight = samples['weight'].mean() # the 'mean' method calculates the mean of the non-null values in the column
df1 = samples.copy() # making a copy to work with so that we do not edit our original data
df1['weight'] = df1['weight'].fillna(mean_weight) # the 'fillna' method fills all missing values with the provided value
df1.head()

OUTPUT

           month  day  year  plot_id species_id sex  hindfoot_length     weight      genus   species    taxa                 plot_type
record_id
1              7   16  1977        2         NL   M             32.0  53.216647    Neotoma  albigula  Rodent                   Control
2              7   16  1977        3         NL   M             33.0  53.216647    Neotoma  albigula  Rodent  Long-term Krat Exclosure
3              7   16  1977        2         DM   F             37.0  53.216647  Dipodomys  merriami  Rodent                   Control
4              7   16  1977        7         DM   M             36.0  53.216647  Dipodomys  merriami  Rodent          Rodent Exclosure
5              7   16  1977        3         DM   M             35.0  53.216647  Dipodomys  merriami  Rodent  Long-term Krat Exclosure

The choice to fill in these missing values rather than removing the rows that contain them can have implications for the result of your analysis. It is important to consider your approach carefully. Think about how the data will be used and how these values will impact the scientific conclusions made from the analysis. pandas gives us all of the tools that we need to account for these issues. But we need to be cautious about how the decisions that we make impact scientific results.

Assignment, evaluation, and mutability


Stepping away from dataframes for a moment, the time has come to explore the behaviour of Python a little more.

Challenge

Exercise: variable assignments

What is the value of y after running the following lines?

PYTHON

x = 2
y = x*3
x = 10

PYTHON

x = 2
y = x*3
x = 10
y

OUTPUT

6

Understanding what’s going on here will help you avoid a lot of confusion when working in Python. When we assign something to a variable, the first thing that happens is the righthand side gets evaluated. So when we first ran the line y = x*3, x*3 first gets evaluated to the value of 6, and this gets assigned to y. The variables x and y are independent objects, so when we change the value of x to 10, y is unaffected. This behaviour may be different to what you are used to, e.g. from experience working with data in spreadsheets where cells can be linked such that modifying the value in one cell triggers changes in others.

Multiple evaluations can take place in a single line of Python code and learning to trace the order and impact of these evaluations is a key skill.

PYTHON

x = 10
y = 5
z = 3-(x/y)
z

OUTPUT

1.0

In the example above, x/y is evaluated first before the result is subtracted from 3 and the final calculated value is assigned to z. (The brackets () are not needed in the calculation above but are included to make the order of evaluation clearer.) Python makes each evaluation as it needs to in order to proceed with the next, before assigning the final result to the variable on the lefthand side of the = operator.

This means that we could have filled the missing values in the weight column of our dataframe copy in a single line:

PYTHON

df1['weight'] = df1['weight'].fillna(df1['weight'].mean())

First, the mean weight is calculated (df1['weight'].mean() is evaluated). Then the result of that evaluation is passed into fillna and the result of the filling operation (df1['weight'].fillna(<RESULT OF PREVIOUS>)) is assigned to df1['weight'].

Variable naming

You are going to name a lot of variables in Python! There are some rules you have to stick to when doing so, as well as recommendations that will make your life easier.

  • Make names clear without being too long
    • wkg is probably too short.
    • weight_in_kilograms is probably too long.
    • weight_kg is good.
  • Names cannot begin with a number.
  • Names cannot contain spaces; use underscores instead.
  • Names are case sensitive: weight_kg is a different name from Weight_kg. Avoid uppercase characters at the beginning of variable names.
  • Names cannot contain most non-letter characters: +&-/*. etc.
  • Two common formats of variable name are snake_case and camelCase. A third “case” of naming convention, kebab-case, is not allowed in Python (see the rule above).
  • Aim to be consistent in how you name things within your projects. It is easier to follow an established style guide, such as Google’s, than to come up with your own.
Challenge

Exercise; good variable names

Identify at least one good variable name and at least one variable name that could be improved in this episode. Refer to the rules and recommendations listed above to suggest how these variable names could be better.

mean_weight and samples are examples of reasonably good variable names: they are relatively short yet descriptive.

df2 is not descriptive enough and could be potentially confusing if encountered by somebody else/ourselves in a few weeks’ time. The name could be improved by making it more descriptive, e.g. samples_duplicate.

Mutability

Why did we need to use the copy method to duplicate the dataframe above if variables are not linked to each other? Why not assign a new variable with the value of the existing dataframe object?

PYTHON

df2 = samples

This gets to mutablity: a feature of Python that has caused headaches for many novices in the past! In the interests of memory management, Python avoids making copies of objects unless it has to. Some types of objects are immutable, meaning that their value cannot be modified once set. Immutable object types we have already encountered include strings, integers, and floats. If we want to adjust the value of an integer variable, we must explicitly overwrite it.

Other types of object are mutable, meaning that their value can be changed “in-place” without needing to be explictly overwritten. This includes lists and pandas DataFrame objects, which can be reordered etc. after they are created.

When a new variable is assigned the value of an existing immutable object, Python duplicates the value and assigns it to the new variable.

a = 3.5 # new float object, called "a"
b = a   # another new float object, called "b", which also has the value 3.5

When a new variable is assigned the value of an existing mutable object, Python makes a new “pointer” towards the value of the existing object instead of duplicating it.

some_species = ['NL', 'DM', 'PF', 'PE', 'DS'] # new list object, called "some_species"
some_more_species = some_species # another name for the same list object

This can have unintended consequences and lead to much confusion!

some_more_species[2] = 'CV'
some_species

OUTPUT

['NL', 'DM', 'CV', 'PE', 'DS']

As you can see here, the “PV” value was replaced by “CV” in both lists, even if we didn’t intend to make the change in the some_species list.

This takes practice and time to get used to. The key thing to remember is that you should use the copy method to make a copy of your dataframes to avoid accidentally modifying the data in the original.

PYTHON

df2 = samples.copy()

Groups in Pandas


We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per site.

We can calculate basic statistics for all records in a single column using the syntax below:

PYTHON

samples['weight'].describe()

gives output

PYTHON

count    15186.000000
mean        53.216647
std         44.265878
min          4.000000
25%         24.000000
50%         42.000000
75%         53.000000
max        278.000000
Name: weight, dtype: float64

We can also extract one specific metric if we wish:

PYTHON

samples['weight'].min()
samples['weight'].max()
samples['weight'].mean()
samples['weight'].std()
samples['weight'].count()

But if we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.

PYTHON

# Group data by sex
grouped_data = samples.groupby('sex')

The pandas function describe will return descriptive stats including: mean, median, max, min, std and count for a particular column in the data. Pandas’ describe function will only return summary values for columns containing numeric data.

PYTHON

# Summary statistics for all numeric columns by sex
grouped_data.describe()
# Provide the mean for each numeric column by sex
grouped_data.mean(numeric_only=True)

OUTPUT

       record_id     month        day         year    plot_id  \
sex
F    8371.632960  6.475266  15.411998  1983.520361  11.418147
M    8553.106416  6.306295  15.763317  1983.679177  11.248305

     hindfoot_length     weight
sex
F           31.83258  53.114471
M           32.13352  53.164464

The groupby command is powerful in that it allows us to quickly generate summary stats.

Challenge

Challenge - Summary Data

  1. How many recorded individuals are female F and how many male M?
  2. What happens when you group by two columns using the following syntax and then calculate mean values?
  • grouped_data2 = samples.groupby(['plot_id', 'sex'])
  • grouped_data2.mean(numeric_only=True)
  1. Summarize weight values for each site in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data. by_site['weight'].describe()
  1. The first column of output from grouped_data.describe() (count) tells us that the data contains 15690 records for female individuals and 17348 records for male individuals.
    • Note that these two numbers do not sum to 35549, the total number of rows we know to be in the samples DataFrame. Why do you think some records were excluded from the grouping?
  2. Calling the mean() method on data grouped by these two columns calculates and returns the mean value for each combination of plot and sex.
    • Note that the mean is not meaningful for some variables, e.g. day, month, and year. You can specify particular columns and particular summary statistics using the agg() method (short for aggregate), e.g. to obtain the last survey year, median foot-length and mean weight for each plot/sex combination:

PYTHON

samples.groupby(['plot_id', 'sex']).agg({"year": 'max',
                                           "hindfoot_length": 'median',
                                           "weight": 'mean'})
  1. samples.groupby(['plot_id'])['weight'].describe()

OUTPUT

          count       mean        std  min   25%   50%     75%    max
plot_id
1         909.0  65.974697  45.807013  4.0  39.0  46.0   99.00  223.0
2         962.0  59.911642  50.234865  5.0  31.0  44.0   55.00  278.0
3         641.0  38.338534  50.623079  4.0  11.0  20.0   34.00  250.0
4         828.0  62.647343  41.208190  4.0  37.0  45.0  102.00  200.0
5         788.0  47.864213  36.739691  5.0  28.0  42.0   50.00  248.0
6         686.0  49.180758  36.620356  5.0  25.0  42.0   52.00  243.0
7         257.0  25.101167  31.649778  4.0  11.0  19.0   24.00  235.0
8         736.0  64.593750  43.420011  5.0  39.0  48.0  102.25  178.0
9         893.0  65.346025  41.928699  6.0  40.0  48.0   99.00  275.0
10        159.0  21.188679  25.744403  4.0  10.0  12.0   24.50  237.0
11        905.0  50.260773  37.034074  5.0  29.0  42.0   49.00  212.0
12       1086.0  55.978821  45.675559  7.0  31.0  44.0   53.00  264.0
13        647.0  56.316847  42.464628  5.0  30.5  44.0   54.00  241.0
14        798.0  52.909774  33.993126  5.0  38.0  45.0   51.00  216.0
15        357.0  35.011204  47.396960  4.0  10.0  19.0   33.00  259.0
16        232.0  26.185345  22.040403  4.0  11.0  20.0   40.00  158.0
17        788.0  59.426396  44.751988  4.0  30.0  45.0   61.25  216.0
18        690.0  56.000000  44.368296  5.0  29.0  42.0   53.00  256.0
19        369.0  19.059621  15.320905  4.0   9.0  15.0   23.00  139.0
20        662.0  65.531722  53.234713  6.0  30.0  44.0  110.75  223.0
21        342.0  24.964912  32.230001  4.0   8.0  12.0   27.00  190.0
22        611.0  70.163666  45.955603  5.0  37.0  48.0  118.00  212.0
23        209.0  21.502392  19.647158  4.0  10.0  16.0   24.00  131.0
24        631.0  50.123613  47.017531  4.0  23.0  40.0   47.00  251.0

Quickly Creating Summary Counts in Pandas

Let’s next count the number of samples for each species. We can do this in a few ways, but we’ll use groupby combined with a count() method.

PYTHON

# Count the number of samples by species
species_counts = samples.groupby('species_id')['record_id'].count()
print(species_counts)

Quick & Easy Plotting Data Using Pandas


We can plot our summary stats using Pandas, too.

PYTHON

# Create a quick bar chart
species_counts.plot(kind='bar')
Weight by Species Site

We can also look at how many animals were captured in each site:

PYTHON

total_count = samples.groupby('plot_id')['record_id'].nunique()
# Let's plot that too
total_count.plot(kind='bar')
Challenge

Challenge - Plots

  1. Create a plot of average weight across all species per site.
  2. Create a plot of total males versus total females for the entire dataset.
  1. samples.groupby('plot_id')["weight"].mean().plot(kind='bar')
average weight across all species for each plot
  1. samples.groupby('sex')["record_id"].count().plot(kind='bar')
total males versus total females for the entire dataset
Key Points
  • pandas DataFrames carry many methods that can help you explore the properties and distribution of data.
  • Using the help function, reading error messages, and asking for help are all good strategies when things go wrong.
  • The type of an object determines what kinds of operations you can perform on and with it.
  • Python evaluates expressions in a line one by one before assigning the final result to a variable.