{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Selection\n", "In this notebook, the model selection capabilities of pyPESTO are demonstrated, which facilitate the selection of the best model from a set of possible models. This includes examples of forward, backward, and brute force methods, as well as criteria such as AIC, AICc, and BIC. Various additional options and convenience methods are also demonstrated.\n", "\n", "All specification files use the [PEtab Select](https://github.com/PEtab-dev/petab_select) format, which is a model selection extension to the parameter estimation specification format [PEtab](https://github.com/PEtab-dev/PEtab).\n", "\n", "Dependencies can be installed with `pip3 install pypesto[select]`." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "In this notebook:\n", "\n", "* :class:`pypesto.select.Problem `\n", "* :func:`pypesto.optimize.minimize `" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Model\n", "This example involves a reaction system with two species (`A` and `B`), with their growth, conversion and decay rates controlled by three parameters ($\\theta_1$, $\\theta_2$, and $\\theta_3$). Many different hypotheses will be considered, which are described in the model specifications file. There, a parameter fixed to zero indicates that the associated reaction(s) should not be in the model which resembles a hypothesis.\n", "\n", "Synthetic measurement data is used here, which was generated with the \"true\" model. The comprehensive model includes additional behavior involving a third parameter ($\\theta_3$). Hence, during model selection, models without $\\theta_3=0$ should be preferred." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](model_selection_network.jpeg)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Space Specifications File\n", "\n", "The model selection specification file can be written in the following compressed format.\n", "\n", "| model_subspace_id | petab_yaml | $\\theta_1$ | $\\theta_2$ | $\\theta_3$ |\n", "|:---------|:----------------------------------|:----|:----|:----|\n", "| M1 | example_modelSelection.yaml | 0;estimate | 0;estimate | 0;estimate |\n", "\n", "Alternatively, models can be explicitly specified. The below table is equivalent to the above table.\n", "\n", "| model_subspace_id | petab_yaml | $\\theta_1$ | $\\theta_2$ | $\\theta_3$ |\n", "|:---------|:----------------------------------|:----|:----|:----|\n", "| M1_0\t| example_modelSelection.yaml\t| 0\t | 0 |\t0 |\n", "| M1_1\t| example_modelSelection.yaml\t| 0\t | 0\t| estimate |\n", "| M1_2\t| example_modelSelection.yaml\t| 0\t | estimate |\t0 |\n", "| M1_3\t| example_modelSelection.yaml\t| estimate |\t0\t| 0 |\n", "| M1_4\t| example_modelSelection.yaml\t| 0\t | estimate |\testimate |\n", "| M1_5\t| example_modelSelection.yaml\t| estimate |\t0 |\testimate |\n", "| M1_6\t| example_modelSelection.yaml\t| estimate |\testimate |\t0 |\n", "| M1_7\t| example_modelSelection.yaml\t| estimate | estimate |\testimate |\n", "\n", "Either of the above tables (as [TSV](https://en.wikipedia.org/wiki/Tab-separated_values) files) are valid inputs. Any combinations of cells in the compressed or explicit format is also acceptable, including the following example.\n", "\n", "| model_subspace_id | petab_yaml | $\\theta_1$ | $\\theta_2$ | $\\theta_3$ |\n", "|:---------|:----------------------------------|:----|:----|:----|\n", "| M1 | example_modelSelection.yaml | 0;estimate | 0;estimate | 0 |\n", "| M2 | example_modelSelection.yaml | 0;estimate | 0;estimate | estimate |\n", "\n", "Due to the topology of the example model, setting $\\theta_1$ to zero can result in a model with no dynamics. Hence, for this example, some parameters are set to non-zero fixed values. These parameters are considered as fixed (not estimated) values in criterion (e.g. AIC) calculations.\n", "\n", "The model specification table used in this notebook is shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from IPython.display import HTML, display\n", "\n", "df = pd.read_csv(\"model_selection/model_space.tsv\", sep=\"\\t\")\n", "display(HTML(df.to_html(index=False)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forward Selection, Multiple Searches\n", "Here, we show a typical workflow for model selection. First, a [PEtab Select](https://github.com/petab-dev/petab_select) problem is created, which is used to initialize a pyPESTO model selection problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Disable some logged messages that make the model selection log more\n", "# difficult to read.\n", "import tqdm\n", "\n", "\n", "def nop(it, *a, **k):\n", " return it\n", "\n", "\n", "tqdm.tqdm = nop" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import petab_select\n", "from petab_select import ESTIMATE, Criterion, Method\n", "\n", "import pypesto.select\n", "\n", "petab_select_problem = petab_select.Problem.from_yaml(\n", " \"model_selection/petab_select_problem.yaml\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "import pypesto.logging\n", "\n", "pypesto.logging.log(level=logging.WARNING, name=\"pypesto.petab\", console=True)\n", "\n", "pypesto_select_problem_1 = pypesto.select.Problem(\n", " petab_select_problem=petab_select_problem\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Models can be selected with a model selection algorithm (here: [forward](https://en.wikipedia.org/wiki/Stepwise_regression)) and a comparison criterion (here: [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion)). The forward method starts with the smallest model. Within each following iteration it tests all models with one additional estimated parameter.\n", "\n", "To perform a single iteration, use `select` as shown below. Later in the notebook, `select_to_completion` is demonstrated, which performs multiple consecutive iterations automatically.\n", "\n", "As no initial model is specified here, a virtual initial model with no estimated parameters is automatically used to find the \"smallest\" (in terms of number of estimated parameters) models. In this example, this is the model `M1_0`, which has no estimated parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reduce notebook runtime\n", "model_problem_options = {\n", " \"minimize_options\": {\n", " \"n_starts\": 10,\n", " \"filename\": None,\n", " \"progress_bar\": False,\n", " },\n", "}\n", "\n", "best_model_1, _ = pypesto_select_problem_1.select(\n", " method=Method.FORWARD,\n", " criterion=Criterion.AIC,\n", " model_problem_options=model_problem_options,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To search more of the model space, hence models with more parameters, the algorithm can be repeated. As models with no estimated parameters have already been tested, subsequent `select` calls will begin with the next simplest model (in this case, models with exactly 1 estimated parameter, if they exist in the model space), and move on to more complex models.\n", "\n", "The best model from the first iteration is supplied as the predecessor (initial) model here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_model_2, _ = pypesto_select_problem_1.select(\n", " method=Method.FORWARD,\n", " criterion=Criterion.AIC,\n", " model_problem_options=model_problem_options,\n", " predecessor_model=best_model_1,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting routines to visualize the best model at each iteration of the selection process, or to visualize the graph of models that have been visited in the model space are available." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import petab_select.plot as plot\n", "\n", "selected_models = [best_model_1, best_model_2]\n", "\n", "plot_data = plot.PlotData(\n", " models=petab_select_problem.state.models,\n", " criterion=Criterion.AIC,\n", " relative_criterion=False,\n", ")\n", "plot.line_best_by_iteration(plot_data=plot_data)\n", "\n", "plot_data.relative_criterion = True\n", "plot.line_best_by_iteration(plot_data=plot_data);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_data = plot.PlotData(\n", " models=petab_select_problem.state.models,\n", " criterion=Criterion.AIC,\n", ")\n", "plot_data.augment_labels(criterion=True)\n", "plot.graph_history(plot_data=plot_data);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Backward Selection, Custom Initial Model\n", "Backward selection is specified by changing the algorithm from `Method.FORWARD` to `Method.BACKWARD` in the `select()` call.\n", "\n", "A custom initial model is specified with the optional `predecessor_model` argument of `select()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from petab_select import Model\n", "\n", "# The PEtab Select problem contains information about calibrated models,\n", "# so that they are not recalibrated. In order to do a completely new\n", "# model selection, one can reload the original PEtab Select problem.\n", "petab_select_problem = petab_select.Problem.from_yaml(\n", " \"model_selection/petab_select_problem.yaml\"\n", ")\n", "pypesto_select_problem_2 = pypesto.select.Problem(\n", " petab_select_problem=petab_select_problem\n", ")\n", "\n", "model_subspace_petab_yaml = \"model_selection/example_modelSelection.yaml\"\n", "initial_model = Model(\n", " model_id=\"myModel\",\n", " model_subspace_petab_yaml=model_subspace_petab_yaml,\n", " model_subspace_id=\"dummy_myModel\",\n", " model_subspace_indices=[0, 0, 0],\n", " parameters={\n", " \"k1\": 0.1,\n", " \"k2\": ESTIMATE,\n", " \"k3\": ESTIMATE,\n", " },\n", " criteria={petab_select_problem.criterion: np.inf},\n", ")\n", "\n", "print(\"Initial model:\")\n", "print(initial_model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pypesto_select_problem_2.select(\n", " method=Method.BACKWARD,\n", " criterion=Criterion.AIC,\n", " predecessor_model=initial_model,\n", " model_problem_options=model_problem_options,\n", ");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_data = plot.PlotData(\n", " models=petab_select_problem.state.models,\n", " criterion=Criterion.AIC,\n", ")\n", "plot_data.augment_labels(criterion=True)\n", "plot.graph_history(plot_data=plot_data);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Options\n", "There exist additional options that can be used to further customise selection algorithms.\n", "\n", "### Select First Improvement\n", "At each selection step, as soon as a model that improves on the previous model is encountered (by the specified criterion), it is selected and immediately used as the previous model in the next iteration of the selection. This is unlike the default behaviour, where all test models at each iteration are optimized, and the best of these is selected.\n", "\n", "### Use Previous Maximum Likelihood Estimate as Startpoint\n", "The maximum likelihood estimate from the previous model is used as one of the startpoints in the multistart optimization of the test models. The default behaviour is that all startpoints are automatically generated by pyPESTO.\n", "\n", "### Minimize Options\n", "Optimization can be customised with a dictionary that specifies values for the corresponding keyword arguments of [minimize](https://github.com/ICB-DCM/pyPESTO/blob/master/pypesto/optimize/optimize.py).\n", "\n", "### Criterion Options\n", "Currently implemented options are: `Criterion.AIC` ([Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion)), `Criterion.AICC` ([corrected AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion)), and `Criterion.BIC` ([Bayesian information criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion)).\n", "\n", "#### Criterion Threshold\n", "A threshold can be specified, such that only models that improve on previous models by the threshold amount in the chosen criterion are accepted." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "petab_select_problem = petab_select.Problem.from_yaml(\n", " \"model_selection/petab_select_problem.yaml\"\n", ")\n", "pypesto_select_problem_3 = pypesto.select.Problem(\n", " petab_select_problem=petab_select_problem\n", ")\n", "best_models = pypesto_select_problem_3.select_to_completion(\n", " method=Method.FORWARD,\n", " criterion=Criterion.BIC,\n", " select_first_improvement=True,\n", " startpoint_latest_mle=True,\n", " model_problem_options=model_problem_options,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_data = plot.PlotData(\n", " petab_select_problem.state.models,\n", " criterion=Criterion.BIC,\n", ")\n", "plot.line_best_by_iteration(plot_data=plot_data);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_data.relative_criterion = False\n", "plot_data.augment_labels(criterion=True)\n", "plot.graph_history(plot_data=plot_data);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Postprocessors\n", "\n", "You can optionally provide postprocessing methods that operate on calibrated models. For example, there are postprocessors to save the pyPESTO result to disk, or save a summarized report of models to a TSV file. You can also write your own method, which should take only a single `pypesto.select.ModelProblem` object as input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Repeat with AICc and criterion_threshold == 10\n", "petab_select_problem = petab_select.Problem.from_yaml(\n", " \"model_selection/petab_select_problem.yaml\"\n", ")\n", "pypesto_select_problem_4 = pypesto.select.Problem(\n", " petab_select_problem=petab_select_problem\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Setup a postprocessor that saves the pyPESTO result objects to disk, as well as a summarized report.\n", "import shutil\n", "from functools import partial\n", "from pathlib import Path\n", "\n", "from pypesto.select.postprocessors import (\n", " multi_postprocessor,\n", " report_postprocessor,\n", " save_postprocessor,\n", ")\n", "\n", "output_path = Path() / \"output_select\"\n", "if output_path.exists():\n", " shutil.rmtree(str(output_path))\n", "output_path.mkdir(exist_ok=True, parents=True)\n", "model_postprocessor = partial(\n", " multi_postprocessor,\n", " postprocessors=[\n", " partial(save_postprocessor, output_path=output_path),\n", " partial(\n", " report_postprocessor, output_filepath=output_path / \"report.tsv\"\n", " ),\n", " ],\n", ")\n", "\n", "best_models = pypesto_select_problem_4.select_to_completion(\n", " method=Method.FORWARD,\n", " criterion=Criterion.AICC,\n", " select_first_improvement=True,\n", " startpoint_latest_mle=True,\n", " model_problem_options=model_problem_options,\n", " criterion_threshold=10,\n", " model_postprocessor=model_postprocessor,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(output_path / \"report.tsv\", sep=\"\\t\")\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_data = plot.PlotData(\n", " models=petab_select_problem.state.models,\n", " criterion=Criterion.AICC,\n", ")\n", "plot.line_best_by_iteration(plot_data=plot_data);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_data.relative_criterion = False\n", "plot_data.augment_labels(criterion=True)\n", "plot.graph_history(plot_data=plot_data);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving results\n", "\n", "Although pyPESTO results and summaries can be saved to disk via the postprocessors, most analysis methods will expect a `petab_select.Models` object, which is created during model selection and can also be saved to disk for later analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "petab_select_problem.state.models.to_yaml(output_path / \"my_models.yaml\")\n", "loaded_models = petab_select.Models.from_yaml(output_path / \"my_models.yaml\")\n", "\n", "plot_data = plot.PlotData(\n", " models=loaded_models, criterion=Criterion.AICC, relative_criterion=False\n", ")\n", "plot_data.augment_labels(criterion=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional plotting methods\n", "\n", "There are additional plotting methods provided by the PEtab Select library, which are demonstrated in another notebook: https://petab-select.readthedocs.io/en/stable/examples/visualization.html\n", "\n", "An example is this ordering of models according to the iteration in which they were calibrated. N.B. The predecessor of M1_7 is actually M1_1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot.graph_iteration_layers(\n", " plot_data=plot_data,\n", " draw_networkx_kwargs={\n", " \"arrowstyle\": \"-|>\",\n", " \"node_shape\": \"s\",\n", " \"node_size\": 4000,\n", " \"edgecolors\": \"k\",\n", " },\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multistart\n", "Multiple model selections can be run by specifying multiple initial models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "petab_select_problem = petab_select.Problem.from_yaml(\n", " \"model_selection/petab_select_problem.yaml\"\n", ")\n", "pypesto_select_problem_5 = pypesto.select.Problem(\n", " petab_select_problem=petab_select_problem\n", ")\n", "\n", "initial_model_1 = Model(\n", " model_id=\"myModel1\",\n", " model_subspace_petab_yaml=model_subspace_petab_yaml,\n", " model_subspace_id=\"dummy_myModel1\",\n", " model_subspace_indices=[0, 0, 0],\n", " parameters={\n", " \"k1\": 0,\n", " \"k2\": 0,\n", " \"k3\": 0,\n", " },\n", " criteria={petab_select_problem.criterion: np.inf},\n", ")\n", "\n", "initial_model_2 = Model(\n", " model_id=\"myModel2\",\n", " model_subspace_petab_yaml=model_subspace_petab_yaml,\n", " model_subspace_id=\"dummy_myModel2\",\n", " model_subspace_indices=[0, 0, 0],\n", " parameters={\n", " \"k1\": ESTIMATE,\n", " \"k2\": ESTIMATE,\n", " \"k3\": 0,\n", " },\n", " criteria={petab_select_problem.criterion: np.inf},\n", ")\n", "\n", "initial_models = [initial_model_1, initial_model_2]\n", "best_model, best_models = pypesto_select_problem_5.multistart_select(\n", " method=Method.FORWARD,\n", " criterion=Criterion.AIC,\n", " predecessor_models=initial_models,\n", " model_problem_options=model_problem_options,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_data = plot.PlotData(\n", " models=petab_select_problem.state.models,\n", " criterion=Criterion.AIC,\n", " relative_criterion=False,\n", ")\n", "plot.graph_history(plot_data);" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" }, "vscode": { "interpreter": { "hash": "44a9cdcbdccbf05a880e90d2e6fe72470baab4d1b82472d890be0596ed887a6b" } } }, "nbformat": 4, "nbformat_minor": 4 }