I don’t know about you, but I find caterpillars to be a bit creepy. On the other hand, I find butterflies to be beautiful. Oddly enough, this aligns to my views on the different stages of data in relation to model development.
As a financial institution (FI) prepares for CECL, it is strongly suggested (by me at least) to know which stage the data falls into. Knowing its stage provides one with guidance on how to proceed.
At FRG we use the term dirty data to describe data that is ugly. Dirty data typically has these following characteristics (the list is not comprehensive):
- Unexplainable missing values: The key word is unexplainable. Missing values can mean something (e.g., a value has not been captured yet) but often they indicate a problem. See this article for more information.
- Inconsistent values: For example, a character variable that holds values for state might have Missouri, MO, or MO. as values. A numeric variable for interest rate might have a value as a percent (7.5) and a decimal (0.075)
- Poor definitional consistency: This occurs when a rule that is used to classify some attribute of an account changes during history. For example, at one point in history a line of credit might be indicated by a nonzero original commitment amount, but at a different point it might be indicated by whether a revolving flag is non-missing.
You should not model or perform analysis using dirty data. Therefore, the next step in the process is to transition dirty data into clean data.
Transitioning to clean data, as the name implies, requires scrubbing the information. The main purpose of this step is to address the issues identified in the dirty data. That is, one would want to fix missing values (e.g., imputation), standardized variable values (e.g., all states are identified by a two-character code), and correct inconsistent definitions (e.g., a line indicator is always based on nonzero original commitment amount).
A final step must be taken before data can be used for modeling. This step takes clean data and converts it to model-ready data.
At FRG we use the term model-ready to describe clean data with the application of relevant business definitions. An example of a relevant business definition would be how an FI defines default. Once the definition has been created the corresponding logic needs to be applied to the clean data in order to create, say, a default indicator variable.
Just like a caterpillar metamorphosing to a butterfly, dirty data needs to morph to model-ready for an FI to enjoy its true beauty. And, only then, can an FI move forward on model development.
Jonathan Leonardelli, FRM, Director of Business Analytics for the Financial Risk Group, leads the group responsible for model development, data science, documentation, testing, and training. He has over 15 years’ experience in the area of financial risk.
 E.g., is it 90+ days past due (DPD) or 90+ DPD or in bankruptcy or in non-accrual or …?