PRESC is a toolkit for the evaluation of machine learning classification models. Its goal is to provide insights into model performance which extend beyond standard scalar accuracy-based measures and into areas which tend to be underexplored in applications, including:

  • Generalizability of the model to unseen data for which the training set may not be representative

  • Sensitivity to statistical error and methodological choices

  • Performance evaluation localized to meaningful subsets of the feature space

  • In-depth analysis of misclassifications and their distribution in the feature space

As a tool, PRESC is intended for use by ML engineers to assist in the development and updating of models. Given a dataset and machine learning classifer, it runs evaluations covering different aspects of model performance. These can be explored individually, eg. in a Jupyter notebook, or they can be viewed collectively in a standalone graphical report.

An eventual goal is to provide integration into a Continuous Integration workflow: evaluations would be run as a part of CI, for example, on regular model updates, and fail if metrics produce unacceptable values.