Single Instruction Multiple Data (SIMD) Parallelism
A common data analysis task is to carry out the same analysis (Single Instruction) on multiple segments of data (Multiple Data). This is known among the parallel computing community as a Single Instruction Multiple Data (SIMD) execution task. Mapping operations, which abound in data analyses, are essentially SIMD tasks.
Cauldron notebooks can be used in SIMD tasks by invoking them in production mode with additional arguments. Consider a case where we want to build a per-state model from census data. We start by building a model in a Cauldron notebook on a particular state by loading the data for that state:
01
02
03
04
05
import pandas as pd df = pd.load('census_data/California.csv') # Analysis and modelling of the data...
Once we're satisfied with the model, we want to run it on all 50 states. This would be annoying to do manually. But with Cauldron, you can run production executions of the notebook for each state easily. First let's modify the notebook step that loads the data from what we had above to:
01
02
03
04
05
06
07
08
import pandas as pd import cauldron as cd state_name = cd.shared.fetch('state_name', default_value='California') df = pd.load('census_data/{}.csv'.format(state_name)) # Analysis and modelling of the data...
We create a state_name variable on line 4 that gets its value from the notebook's shared data. Since the shared state_name variable has not been set, it will return the default value of 'California'. This is the exact same behavior as the previous version of this step where 'California' was hard coded into the path string.
When run in production we are able to set the state_name shared variable outside of the notebook. The same notebook can then run on different data for each state:
01
02
03
04
05
06
07
08
09
10
11
12
for state_name in states: project_directory = '/directory/of/my/cauldron/notebook/project' output_directory = '/results/{}'.format(state_name) logging_path = '/log/{}.log'.format(state_name) cauldron.run_project( project_directory, output_directory, logging_path, state_name=state_name )
We wrap the production execution call in a for loop that iterates over all of the state names and produce state-name-specific output and logging paths so that the output of each state doesn't overwrite the previous ones. Then add the state_name variable to the execution call on line 11. Any parameter specified here will be added to Cauldron's shared object before the notebook is run.
Now the notebook will be run once for each state with a different value for the state_name shared variable in each run. This was done in a non-destructive fashion. You can always open the notebook up later for interactive editing to update and improve the analysis and then re-run in production without modification.
There's also no reason that these calls have to be made within a single for-loop like shown above. They don't even have to be run on the same computer. They could be run as 50 different python scripts on 50 different nodes of a cluster if you want. They could also be made into 50 Luigi tasks as part of a larger data pipeline. The point is that an analysis built in a Cauldron notebook can be used directly in production in a larger context.