Single Instruction Multiple Data (SIMD) Parallelism
A common data analysis task is to carry out the same analysis
(Single Instruction) on multiple segments of data (Multiple Data). This is
known among the parallel computing community as a Single Instruction
Multiple Data (SIMD) execution task. Mapping operations, which abound in
data analyses, are essentially SIMD tasks.
Cauldron notebooks can be used in SIMD tasks by invoking them in production
mode with additional arguments. Consider a case where we want to build a
per-state model from census data. We start by building a model in a Cauldron
notebook on a particular state by loading the data for that state:
import pandas as pd
df = pd.load('census_data/California.csv')
Once we're satisfied with the model, we want to run it on all 50 states.
This would be annoying to do manually. But with Cauldron, you can run
production executions of the notebook for each state easily. First let's
modify the notebook step that loads the data from what we had above to:
import pandas as pd
import cauldron as cd
state_name = cd.shared.fetch('state_name', default_value='California')
df = pd.load('census_data/{}.csv'.format(state_name))
We create a state_name variable on line 4 that gets its value from
the notebook's shared data. Since the shared state_name variable has
not been set, it will return the default value of 'California'. This is the
exact same behavior as the previous version of this step where 'California'
was hard coded into the path string.
When run in production we are able to set the state_name shared
variable outside of the notebook. The same notebook can then run on different
data for each state:
01
02
03
04
05
06
07
08
09
10
11
12
for state_name in states:
project_directory = '/directory/of/my/cauldron/notebook/project'
output_directory = '/results/{}'.format(state_name)
logging_path = '/log/{}.log'.format(state_name)
cauldron.run_project(
project_directory,
output_directory,
logging_path,
state_name=state_name
)
We wrap the production execution call in a for loop that iterates over all of
the state names and produce state-name-specific output and logging paths so
that the output of each state doesn't overwrite the previous ones. Then add
the state_name variable to the execution call on line 11. Any parameter
specified here will be added to Cauldron's shared object before the notebook
is run.
Now the notebook will be run once for each state with a different value for
the state_name shared variable in each run. This was done in a
non-destructive fashion. You can always open the notebook up later for
interactive editing to update and improve the analysis and then re-run in
production without modification.
There's also no reason that these calls have to be made within a single
for-loop like shown above. They don't even have to be run on the same
computer. They could be run as 50 different python scripts on 50 different
nodes of a cluster if you want. They could also be made into 50 Luigi tasks
as part of a larger data pipeline. The point is that an analysis built in
a Cauldron notebook can be used directly in production in a larger context.