# Roadmap¶

This is an overview of the evaluations planned for integration into PRESC. It is intended to give a high-level description of how these will work and sample use cases. Prioritization and implementation details are maintained in the repo issues.

At the core of PRESC is a collection of evaluations that can be run on a given statistical model and dataset pair to inform the developer on different aspects of the model’s performance and behaviour. The two main intended uses are a graphical presentation in a report and the detection of potential issues by comparing the results against threshold values determined by the user. In either case, results will require some degree of interpretation in the context of the problem domain, and it will be up to the user to decide on a course of action to correct deficiencies in the model surfaced by these evaluations.

Planned evaluations are described below, grouped by theme. Some of these will lend themselves to multiple possible visualizations or summaries, while others will be applicable in a single clear way. The first step in developing many of these will be to build a prototype and test them out against different models and datasets to get an idea of how they behave. Related literature or implementations that we are aware of are referenced below. Contributions that link additional references to related work are welcome.

For each one, we list expected inputs and output structure, as well as the ways we expect it to be used. The description focuses on the underlying computation rather than the ways results should be presented or visualized. For some of these, we will want to further summarize the outputs, while others will be reported as is.

## Misclassifications¶

Many common accuracy metrics involve scalar counts or rates computed from the confusion matrix. However, the misclassified points themselves carry much more information about the model behaviour. They are indicative of failures in the model, and understanding why they were misclassified can help improve it.

For example:

Is the misclassification due to the characteristics of the point itself or to the model?

It may not be surprising for an outlier to get misclassified by most reasonable models.

A point in an area of high overlap between the classes may get misclassified by some candidate models and not by others, depending on where the decision boundary lands.

How different is the distribution of misclassified points in feature space from that for correctly classified points?

Is there evidence of systematic bias?

**Application scope:** These generally apply to the predictions on a test set by
a trained model, such as the final evaluation on a held-out test set or model
selection on a validation set.

### Conditional metrics¶

This is implemented in the conditional_metric module.

Standard performance metrics such as accuracy, precision and recall are computed by summmarizing overall differences between predicted and true labels. PRESC will additionally compute these differences restricted to subsets of the feature space or test set. This way, the confusion matrix and related metrics can be viewed as they vary across the values of a feature. This is similar to calibration, which considers accuracy as a function of predicted score.

**Input:**

Predicted labels for a test set from a trained model

Scheme for partitioning the test set

eg. binning values of a given feature

Metric

function of predicted and true labels

**Output:** Metric values for each partition

**Applications:**

Performance metrics as a function of partitions:

Misclassification counts by class

Standard accuracy metrics (eg. accuracy, precision, recall)

Proportion of misclassified belonging to a specific class

Deviation of these per-partition values from the value over the entire test set

**Type**: Model performance metric

### Conditional feature distributions¶

This is implemented in the conditional_distribution module.

In a sense this reverses the conditioning of the conditional confusion matrix. We compute the distribution of a feature over the test set restricted to each cell of the confusion matrix. This allows us to compare distributions between misclassified and correctly classified points.

**Input:**

Predicted labels for a test set from a trained model

Column of data from the test set

eg. values of a feature

could also be predicted scores or a function of the features

**Output:** Distributional representation (eg. value counts, histogram or
density estimate) for each cell in the confusion matrix

**Applications:**

Feature exploration conditional on test set predicted outcomes

Assessment of differences between misclassified and correctly classified points in terms of their distribution in feature space

Within one class, between multiple classes, or relative to the training set

Evidence of bias in the misclassifications

Are misclassifications concentrated in an area of strong overlap between the classes in the training set?

Are misclassifications clustered, eg. separated from the majority of training points of that class?

**Type**: Feature distributions

### Counterfactual accuracy¶

How much does the performance of an optimal model which correctly classifies a misclassified point differ from the current model? This is measured by searching (the parameter space) for the best performing model, subject to the constraint that a specific point is correctly classified (Bhatt et al (2020)).

**Input:**

Trainable model

ie. model specification and training set

Labeled sample point

ie. misclassified test set point

**Output:** Trained model which correctly classifies the sample point

**Applications:**

Cost measure for correcting misclassifications

Measure of whether a misclassification is more likely due to its inherent characteristics or to the choice of model.

If forcing a correct classification substantially decreases accuracy, then it is likely an unusual point relative to the training set (ie. an influential point in the statistical sense).

If the change in accuracy is minimal, the misclassification may be an artifact of the methodology used to select the model.

**Type**: Per-sample metric applied to misclassifications

### Class fit¶

In addition to looking at distributions across misclassifications, it is useful to have a distributional goodness-of-fit metric for how much a misclassified point “belongs” to each class. We compute entropy-based goodness-of-fit between a misclassified point and each “true” class as represented by the training set.

**Input:**

Sample point

ie. misclassified test set point

Dataset

ie. training set

Datapoints can refer to either the original feature space or an embedding/transformation.

**Output:** Scalar goodness-of-fit measure for each class

**Applications:**

Measure of surprisal for misclassifications

Was the point misclassified because it looks much more like a member of its predicted class than its true class?

If it fits well in multiple classes, it may be in an area of high overlap

If it doesn’t fit well in any class, it may be an outlier.

Deviation from a baseline distribution computed using the same approach for correctly classified points

**Type**: Per-sample metric applied to misclassifications

### Spatial distributions¶

This is partially implemented in the spatial_distribution module.

In some cases, it will be helpful to understand where a misclassified point lies in the feature space in relation to other training points. While this does not translate to intuition about model behaviour for all types of model, it can still be useful as a view into the geometry of the feature space. PRESC does this by computing the distribution of pairwise distances between a misclassified point and other training points split by class.

**Input:**

Sample point

ie. misclassified test set point

Dataset

ie. training set

Metric to measure distances in the feature space

eg. Euclidian

Datapoints and metric can refer to either the original feature space or an embedding/transformation.

**Output:** Distributional representation (histogram or density estimate) for
each class

**Applications:**

Geometric class-fit measure for misclassifications

Can help to distinguish between misclassifications that are outliers (far from all training points), those which lie in an area of high overlap, and those which are closer to a different class

Deviation from a baseline distribution computed using the same approach for correctly classified points

**Type**: Per-sample metric applied to misclassifications

## Robustness to unseen data¶

Ideally, our model should perform well for unseen data points for which predictions are requested. Standard performance measures, computed by averaging results over a random split of the data, represent average-case performance for data which is identically distributed to the training data, an assumption which is often not met in practice. Here we consider performance evaluations that account for distributional differences in training and test sets.

**Application scope:** These can be applied either as strategies for model
selection or as an evaluation methodology on a final test set.

### Feature-based splitting¶

While we don’t know along which dimensions the unseen data will differ from our training set, we can take into account possible training set bias by validating over explicitly biased splits. These are generated by holding out training points whose values for a given feature fall in a particular range.

**Input:**

Dataset

ie. training set

Scheme for partitioning the test set

eg. binning values of a given feature

Datapoints and partitioning can refer to either the original feature space or an embedding/transformation.

**Output:** Sequence of splits of the dataset holding out one partition each time

**Applications:**

Model selection using cross-validation taking training set bias into account

Feature selection using susceptibility to bias as a criterion

Model performance range estimate in the face of biased data

ie. evaluate on test-set data belonging to the held out partition, having trained on training data in the other partitions

Deviation from overall performance metric over the entire test set

**Type**: Dataset splitting scheme for validation

### Entropy-based splitting¶

Another approach to non-random splitting is to partition in terms of distributional differences rather than the values of a specific feature. Here we generate splits achieving a target distributional dissimilarity value between the train and test portions.

**Input:**

Dataset

ie. training set

Datapoints can refer to either the original feature space or an embedding/transformation.

**Output:** Sequence of randomized splits of the dataset achieving target dissimilarity (K-L divergence) values

**Applications:**

Model selection using cross-validation taking into account robustness to unseen data

Model performance range estimate in the face of data shift

ie. select one subset from the training set and another from the test set

Deviation from overall performance metric over the entire test set

**Type**: Dataset splitting scheme for validation

### Label flipping¶

How many training labels would have to change for the decision boundary to move? A more robust model which is not overfit should be able to sustain more label changes without a significant impact on its performance. Label flipping could occur in practice, for example, if some of the training data is mislabeled, or certain areas of the feature space can reasonably belong to multiple classes. To measure this, we compute the change in model performance as more and more training point labels are flipped.

**Input:**

Training set

Model specification

Performance measurement scheme

ie. test set and performance metric

**Output:** Function (sequence of tuples) mapping number of flipped labels to
performance metric values

**Applications:**

Measure of robustness to mislabeled data

Measure of overfitting

Influence measure for points or clusters in the training data

**Type**: Model performance metric

### Novelty¶

Scenarios in which classification models are deployed generally evolve over time, and eventually the data used to train the model may no longer be representative of the cases for which predictions are requested. PRESC will include functionality to determine how much a new set of labeled data (if available) has diverged from the current training set. This will help to inform when a model update is needed.

**Input:**

Previous dataset

ie. current training set

New dataset

ie. newly available labeled data

**Output:** Similarity measure between the two datasets (scalar or
distributional)

**Applications:**

Measure of novelty for new labeled data (eg. available as a result of ongoing human review)

Measure of appropriateness of the model on new data

eg. improvement in performance on the new data between a model trained including the new data and the original model, as a function of novelty

Decision rule for when to trigger a model update

Deviation from baseline computed from subsets of the existing training set

**Type**: Dataset comparison metric

## Stability of methodology¶

While models are generally selected to maximize some measure of performance, the final choice of model also carries an inherent dependence on the methodology used to train it. For example, if a different train-test split were used, the final model would likely be slightly different. These analyses measure the effects of methodological choices on model performance, with the goal of minimizing them. Note that, for any of these approaches which uses resampling to estimate variability, computation cost needs to be taken into account as a part of the design.

**Application scope:** These generally require a model specification and
training set, and can be applied post-hoc (to assess error in reported
results) or prior to training (to help select methodology, using an assumed
prior model choice).

### Train-test splitting¶

This is implemented in the train_test_splits module.

When training a model, a test set is typically held out at the beginning so as to provide an unbiased estimate of model performance on unseen data. However, the size of this test set itself influences the quality of this estimate. To assess this, we consider the variability and bias in performance metrics as the test set size varies. This is estimated by splitting the input dataset at different proportions and training and testing a model across these.

**Input:**

Dataset

ie. training set

Model specification

Performance metric

**Output:** Function (sequence of tuples) mapping train-test split size to
performance estimates represented as a mean with confidence bounds

**Applications:**

Measure of error (variance/bias) in test set performance evaluations

Selection of train-test split size to minimize bias and variance

Deviation from average-case performance

**Type**: Model performance confidence metric

### Cross-validation folds¶

Similarly, the choice of validation methodology will influence the quality of estimates obtained using it. PRESC assesses this by computing model cross-validation (CV) model performance estimates across different numbers of folds.

**Input:**

Dataset

ie. training set

Model specification

Performance metric

**Output:** Function (sequence of tuples) mapping number of CV folds to
performance estimates represented as a mean with confidence bounds

**Applications:**

Measure of error (variance/bias) in CV performance evaluations

Selection of number of CV folds to minimize bias and variance

Deviation from average-case performance

**Type**: Model performance confidence metric