Step Testing with pytest
Cauldron includes a cauldron.steptest.create_test_fixture function that allows steps to be "unit" tested using standard Python testing methods in pytest. This function wraps the functionality needed for automatically setting up and tearing down the Cauldron project state before and after each test.
A Simple Example
The code for this example can be found in the Cauldron Gallery at:
This example is highly simplified to emphasize the key concepts of step testing. We start with a notebook containing two steps.
example-notebook
cauldron.json
S01-Load-Data.py
S02-Create-Total.py
The first step loads data from a CSV file into a Panda's DataFrame:
STEP 1:
S01-Load-Data.py
01
02
03
04
05
06
07
08
09
10
11
12
import cauldron as cd import pandas as pd # Load the CSV data into a DataFrame. df: pd.DataFrame = pd.read_csv('data.csv') # Show the DataFrame in the display. cd.display.table(df) # Store the DataFrame in shared variables for other steps # to access. cd.shared.df = df
The second step adds a new "total" column to that data frame that is based on the addition of two existing columns in the data frame:
STEP 2:
S02-Create-Total.py
01
02
03
04
05
06
07
08
09
10
11
12
13
import cauldron as cd import pandas as pd # Retrieve the stored data DataFrame. df: pd.DataFrame = cd.shared.df df['total'] = df['part_one'] + df['part_two'] # Show the DataFrame in the display. cd.display.table(df) # Share the updated data frame. cd.shared.df = df
If the data.csv contains missing values in either the 'part_one' or 'part_two' columns we will end up with a NaN value in the new 'total' column. That's not the behavior that we want. Instead, any NaN value in the 'part_one' or 'part_two' columns should be treated as zero during the summation of the total value.
If you're familiar with Pandas, you probably already have ideas on how to achieve this. But first we're going to create a step unit test that will validate our solution and fail given the current code in the second step.
We begin by creating a Python file to contain our unit test.
info_outline
This file must be placed somewhere within the notebook folder or Cauldron will be unable to locate the notebook and automatically initialize it for running the tests.
For this example we will place it within a step_tests subdirectory beneath the root notebook directory and call it ./step_tests/test_notebook.py. Inside this file we include the following:
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import cauldron as cd from cauldron import steptest import pandas as pd import numpy as np test_fixture = steptest.create_test_fixture(__file__) def test_missing_values(tester: steptest.CauldronTest): """ should not have NaN values in the total column """ # Assign to the shared df variable a fictional data frame with only # a single row and the part_one column value will is missing cd.shared.df = pd.DataFrame(dict( part_one=[None], part_two=[12] )) # Run the step tester.run_step('S02-Create-Total.py') # Retrieve the modified data frame from the shared variables df = cd.shared.df # Confirm that the total column value is not NaN assert not np.isnan(df['total'].values[0])
There are many ways to run Python unit tests depending on your choice of development tools. In this example we'll run the test from the command line using the command:
01
$ python -m pytest test_notebook.py
which must be executed from within the step_tests folder within the root directory of the notebook. The execution of this command yields the following console output:
================================ FAILURES ========================================= ____________________________ test_missing_values __________________________________ tester = ... > assert not np.isnan(df['total'].values[0]) E AssertionError: assert not True E + where True = (nan) E + where = np.isnan test_notebook.py:26: AssertionError ================== 1 failed, 0 warnings in 3.27 seconds ===========================
The test has failed because the total column contains a NaN value. We can now go back to the second step and change our code so that it handles missing values within the source columns. The updated code looks like this:
STEP 2:
S02-Create-Total.py
01
02
03
04
05
06
07
08
09
10
11
12
13
import cauldron as cd import pandas as pd # Retrieve the stored data DataFrame. df: pd.DataFrame = cd.shared.df df['total'] = df['part_one'].fillna(0) + df['part_two'].fillna(0) # Show the DataFrame in the display. cd.display.table(df) # Share the updated data frame. cd.shared.df = df
Running the test again with these changes yields a successful output:
================== 1 passed, 0 warnings in 2.61 seconds ===========================
We now have a test that validates the desired behavior of avoiding NaN in the 'total' column. This unit test can be run at any time to confirm that the code continues to behave properly as changes are made to the notebook. It is good practice to run unit tests regularly as you make changes to the notebook to make sure that changes haven't caused unintended issues.