StrainDesign Developer’s Guide

Reference paper: Schneider P., Bekiaris P. S., von Kamp A., Klamt S. — StrainDesign: a comprehensive Python package for computational design of metabolic networks. Bioinformatics, btac632 (2022). DOI: 10.1093/bioinformatics/btac632


Table of Contents

  1. Architecture Overview

  2. Repository Structure

  3. Public API & Entry Points

  4. Preprocessing Pipeline

    • 4.1 Model Cleaning

    • 4.2 GPR Integration

    • 4.3 Network Compression

    • 4.4 Regulatory Interventions

  5. MILP Construction

    • 5.1 Overview & z-map Matrices

    • 5.2 Primal LP from COBRApy Model

    • 5.3 SUPPRESS Module — Farkas’ Lemma Dual

    • 5.4 PROTECT Module

    • 5.5 OPTKNOCK — LP Duality for Bilevel Optimization

    • 5.6 ROBUSTKNOCK — Three-Level Duality

    • 5.7 OPTCOUPLE — Growth Coupling Potential

    • 5.8 Linking Binary Variables: link_z

    • 5.9 Big-M Bounding

    • 5.10 Indicator Constraints vs. Big-M

  6. Solver Backends

    • 6.1 MILP_LP Unified Interface

    • 6.2 CPLEX

    • 6.3 Gurobi

    • 6.4 SCIP

    • 6.5 GLPK

  7. Computation Approaches — SDMILP

    • 7.1 ANY

    • 7.2 BEST

    • 7.3 POPULATE

    • 7.4 Exclusion Constraints

    • 7.5 Solution Verification

  8. Post-processing & Result Container

    • 8.1 Solution Decompression

    • 8.2 GPR Translation

    • 8.3 SDSolutions

  9. Constraint & Expression Parsing

  10. Supporting Utilities

  11. Known Issues, Urgent Actions & Future Work

  12. Testing


1. Architecture Overview

StrainDesign solves metabolic strain design problems: given a genome-scale metabolic model (COBRApy Model) and a set of biological objectives, find the minimal sets of gene/reaction knockouts (and additions) that force the network into a desired flux state.

All methods reduce to Mixed-Integer Linear Programs (MILP). The key intellectual contribution is the translation of nested/bilevel biological objectives into a flat MILP via LP duality and Farkas’ lemma, so a single MILP call suffices to find strain designs. The pipeline is:

User (model + SDModules)
        │
        ▼
   Preprocessing        networktools.py, compression.py, lptools.py
   ─ FVA cleanup
   ─ GPR extension
   ─ Network compression
        │
        ▼
   MILP Construction    strainDesignProblem.py  (SDProblem)
   ─ build_primal_from_cbm
   ─ farkas_dualize / LP_dualize
   ─ link_z (bounding + indicator constraints)
        │
        ▼
   Solver Execution     strainDesignMILP.py  (SDMILP)
   ─ ANY / BEST / POPULATE loops
   ─ verify_sd
   ─ exclusion constraints
        │
        ▼
   Post-processing      compute_strain_designs.py, strainDesignSolutions.py
   ─ expand_sd (decompress)
   ─ GPR translation
   ─ SDSolutions

All methods (MCS/suppress/protect, OptKnock, RobustKnock, OptCouple) use the same pipeline — only how addModule builds the LP blocks differs.


2. Repository Structure

straindesign/
├── __init__.py                     # Package exports & avail_solvers detection
├── names.py                        # All string constants (solver names, module types, etc.)
│
├── compute_strain_designs.py       # Main user-facing orchestration function
├── strainDesignModule.py           # SDModule: problem specification (a dict subclass)
├── strainDesignProblem.py          # SDProblem: translates model+modules → MILP matrices
├── strainDesignMILP.py             # SDMILP: solves the MILP, manages solution loop
├── strainDesignSolutions.py        # SDSolutions: result container, GPR translation
│
├── compression.py                  # Network compression (RREF, RationalMatrix, nullspace)
├── networktools.py                 # GPR extension, regulatory extension, compress wrappers
├── lptools.py                      # FVA, flux space plotting, solver selection
├── parse_constr.py                 # Constraint/expression string → matrix conversion
├── indicatorConstraints.py         # IndicatorConstraints data class
├── solver_interface.py             # MILP_LP factory: instantiates the right solver
├── cplex_interface.py              # CPLEX backend (Cplex_MILP_LP)
├── gurobi_interface.py             # Gurobi backend (Gurobi_MILP_LP)
├── scip_interface.py               # SCIP backend (SCIP_MILP_LP)
├── glpk_interface.py               # GLPK backend (GLPK_MILP_LP)
├── efmtool_cmp_interface.py        # EFMtool JAR interface (legacy compression backend)
├── pool.py                         # SDPool: Windows-compatible multiprocessing pool
└── efmtool.jar                     # Bundled EFMtool binary

Tests are in:

tests/
└── *.py   (pytest test files, one per feature area)

Configuration / packaging:

setup.py           # python_requires=">=3.7"; classifiers include 3.9–3.13
conda-recipe/meta.yaml
.github/workflows/CI-test.yml

3. Public API & Entry Points

Primary function

from straindesign import compute_strain_designs, SDModule

sols = compute_strain_designs(
    model,
    sd_modules   = [sd_module1, sd_module2],
    solver       = 'gurobi',      # 'cplex' | 'gurobi' | 'scip' | 'glpk'
    max_cost     = 3,
    max_solutions= 10,
    solution_approach = 'best',   # 'any' | 'best' | 'populate'
    ko_cost      = {'r1': 1, 'r2': 2},   # reaction knockouts
    ki_cost      = {'r3': 1},             # reaction additions
    gko_cost     = {'geneA': 1},          # gene knockouts
    gki_cost     = {'geneB': 1},          # gene additions
    time_limit   = 3600,
)

SDModule specification

# MCS-style: suppress undesired subspace, protect desired one
SDModule(model, 'suppress', constraints='EX_product_e >= 0.1')
SDModule(model, 'protect',  constraints='BIOMASS >= 0.05')

# With inner optimization (force to inner optimum first)
SDModule(model, 'suppress',
         constraints='EX_product_e >= 0.1',
         inner_objective='BIOMASS', inner_opt_sense='maximize')

# OptKnock
SDModule(model, 'optknock',
         inner_objective='BIOMASS',
         outer_objective='EX_product_e')

# RobustKnock
SDModule(model, 'robustknock',
         inner_objective='BIOMASS',
         outer_objective='EX_product_e')

# OptCouple
SDModule(model, 'optcouple',
         inner_objective='BIOMASS',
         prod_id='EX_product_e',
         min_gcp=0.1)

Result access

sols.status           # 'optimal', 'infeasible', 'time_limit', ...
sols.reaction_sd      # list of dicts: {reac_id: -1 (KO) | +1 (KI) | 0 (not added KI)}
sols.gene_sd          # list of dicts: gene-level interventions (if gene_kos=True)
sols.get_strain_designs()          # same as reaction_sd
sols.get_gene_reac_sd_assoc()      # (reac_sd, assoc, gene_sd) with GPR linkage
sols.get_strain_design_costs()     # list of floats
sols.get_reaction_sd_bnds()        # intervention expressed as flux bounds

Lower-level access

from straindesign import SDProblem, SDMILP, MILP_LP
from straindesign import fva, select_solver
from straindesign.compression import compress_cobra_model

4. Preprocessing Pipeline

compute_strain_designs (compute_strain_designs.py) orchestrates all preprocessing before constructing the MILP.

4.1 Model Cleaning

Functions: remove_ext_mets, remove_dummy_bounds, bound_blocked_or_irrevers_fva (all in networktools.py, which re-exports them from compression.py).

Steps:

  1. Deep-copy the model so the user’s model is never mutated.

  2. Remove external metabolites — metabolites with no stoichiometric coupling to the rest of the network (boundary metabolites already handled by COBRApy exchange reactions, but orphaned ones are pruned).

  3. Remove dummy bounds — COBRApy uses ±1000 as a proxy for ±∞. Anything above a bound_thres (from cobra.Configuration) is replaced with ±inf so big-M bounding works correctly.

  4. FVA (fva from lptools.py) — flux variability analysis identifies:

    • Blocked reactions (FVA min = FVA max = 0): forced to zero, removed from intervention candidates.

    • Irreversible reactions (FVA min ≥ 0 or FVA max ≤ 0): bounds tightened.

    • Essential reactions (cannot be zero): excluded from knockout candidates, which helps compression and reduces MILP size.

FVA is run in parallel using SDPool for large models.

4.2 GPR Integration

File: networktools.py Key functions: remove_irrelevant_genes, extend_model_gpr

When gene-level interventions are requested (gko_cost or gki_cost provided), the model is extended with artificial reactions that represent gene-level knockouts. This allows the MILP’s binary variables z to operate on genes rather than reactions.

Step 1 — Remove irrelevant genes (remove_irrelevant_genes)

Uses AST-based GPR evaluation (evaluate_gpr_ast):

  • Each reaction’s GPR rule is parsed to an AST (ast.parse).

  • Genes are removed if their knockout/addition cannot affect any non-blocked, non-essential reaction.

  • This is determined by evaluating the DNF/CNF GPR expression with gene states {True, False, None (undetermined)}.

  • Reduces the gene intervention space before extension.

Step 2 — Extend model with artificial gene reactions (extend_model_gpr)

For each gene/isoenzyme group that participates in the GPR rules, artificial reactions and metabolites are added to encode the boolean logic:

  • AND gate: If a reaction requires g1 AND g2, an artificial metabolite and two artificial reactions are added. Knocking out either gene removes the pathway.

  • OR gate: If g1 OR g2, parallel paths are created. Both must be knocked out to abolish the reaction.

The function returns a reac_map dict mapping original reactions to the new artificial ones in the extended model. This map is used later to:

  1. Translate gko_cost/gki_cost into reaction costs on the extended model.

  2. Expand computed solutions back to gene-level interventions.

Important: Gene names starting with digits are prefixed with 'g' (cobra.manipulation.rename_genes) because Python AST cannot parse identifiers starting with digits.

Step 3 — GPR solution translation (SDSolutions.__init__)

After computation, the artificial reactions in solutions are translated back to gene-level interventions using gpr_eval. This evaluates each solution’s intervention set against the DNF GPR rules to determine which reactions are truly knocked out (or added).

gpr_eval(cj_terms, interv):

  • cj_terms: the GPR rule as a list of “conjunctive clauses” (AND-groups within OR expression).

  • interv: current gene intervention state.

  • Returns True/False/nan (active/inactive/undetermined).

4.3 Network Compression

File: compression.py Orchestrated by: networktools.compress_model

Network compression reduces model size while preserving the strain design problem. It is lossless — all solutions in the compressed space correspond to solutions in the original, and vice versa (via the cmp_mapReac mapping).

RationalMatrix

The core data structure for numerically exact RREF computation. It uses sparse dual-integer storage (numerator matrix + denominator matrix, both scipy.sparse.csr_matrix). Operations are performed in rational arithmetic using fractions.Fraction and sympy.Rational.

Key operations: row operations, scaling, finding pivots, extracting nullspace.

Floating-point coefficients are converted via float_to_rational with configurable precision (detect_max_precision scans the model’s stoichiometry).

Compression Algorithm (StoichMatrixCompressor)

The main class StoichMatrixCompressor performs the following:

  1. Remove conservation relations (remove_conservation_relations): Compute the left nullspace of the stoichiometric matrix S over the rationals (homogeneous dependencies among metabolite rows). Rows that are linearly dependent are removed. This is the RREF step.

  2. Remove blocked reactions (remove_blocked_reactions): Reactions with zero flux under any feasible steady state are removed. Uses FVA for reliable detection.

  3. Detect coupled reactions (compress_model_coupled): Two reactions r_i and r_j are coupled if there exists a scalar α such that in every feasible steady state, v_i = α * v_j. These can be lumped into one reaction. Coupling is detected by computing the right nullspace and identifying proportional columns. Lumpable reactions are merged and their costs combined.

  4. Parallel reactions (compress_model_parallel): Reactions with identical stoichiometry (same column in S) are merged. The resulting “super-reaction” carries the combined cost, and the z-variable controls all of them simultaneously.

Compression Result

cmp_mapReac: maps each compressed reaction to a list of original reactions it represents (with scaling factors).

This is used by:

  • compress_modules(sd_modules, cmp_mapReac) — rewrites constraints/objectives in terms of compressed reaction IDs.

  • compress_ki_ko_cost(cmp_ko_cost, cmp_ki_cost, cmp_mapReac) — maps intervention costs.

  • expand_sd(sd, cmp_mapReac) — after solving, maps binary solutions back to original reactions.

Backend Selection

The compression backend is chosen by the backend parameter:

  • 'sparse_rref' (default): Pure Python, uses RationalMatrix for exact arithmetic. No Java dependency.

  • 'efmtool_rref' (legacy): Uses the bundled efmtool.jar via JPype for RREF. Requires pip install straindesign[java]. Available via compress_model_efmtool.

4.4 Regulatory Interventions

Function: extend_model_regulatory (networktools.py)

Regulatory interventions represent changes that activate or silence a pathway without directly knocking out a gene. They are encoded similarly to gene interventions: artificial reactions are added to the model, and their “presence” or “absence” is controlled by binary z-variables.

After solving, postprocess_reg_sd in compute_strain_designs.py converts regulatory intervention values from numeric (1/0) to boolean (True/False) in the solution dict.


5. MILP Construction

File: strainDesignProblem.py Classes: SDProblem, ContMILP Key functions: build_primal_from_cbm, LP_dualize, farkas_dualize, reassign_lb_ub_from_ineq, prevent_boundary_knockouts, link_z

5.1 Overview & z-map Matrices

The MILP has the form:

minimize   c' * x
subject to:
  A_ineq * x  <=  b_ineq
  A_eq   * x   =  b_eq
  lb  <=  x  <=  ub
  x[0..num_z-1]  ∈ {0,1}         (binary z-variables: interventions)
  x[num_z..]     ∈ ℝ             (continuous: flux variables + dual variables)
  indicator constraints or big-M

The first num_z variables are binary: z_i = 1 means reaction i is knocked out (or added, if z_inverted[i]).

Three z-map matrices track which z-variables affect which parts of the LP:

  • z_map_vars[i, j]: z-variable i controls continuous variable j (knockout/addition of that reaction flux).

  • z_map_constr_ineq[i, j]: z-variable i controls inequality constraint j.

  • z_map_constr_eq[i, j]: z-variable i controls equality constraint j.

A value of +1 means “z=1 removes this constraint/variable” (knockout); -1 means “z=1 adds this” (knock-in).

These maps are propagated through all dualization steps (see below) and ultimately used in link_z to wire up big-M coefficients or indicator constraints.

MILP header rows (fixed positions):

  • Row 0 (idx_row_maxcost): −cost' z 0 (ensures non-negative cost)

  • Row 1 (idx_row_mincost): cost' z max_cost

  • Row 2 (idx_row_obj): objective constraint (used by fixObjective)

5.2 Primal LP from COBRApy Model

Function: build_primal_from_cbm(model, V_ineq, v_ineq, V_eq, v_eq, c=None)

Constructs the standard LP from a COBRApy model:

A_eq  * v  =  b_eq    (stoichiometric matrix S stacked with V_eq)
A_ineq* v  <= b_ineq  (V_ineq constraints, e.g. flux ratio constraints)
lb <= v <= ub         (reaction bounds from model)
  • A_eq = [S; V_eq], b_eq = [0; v_eq]

  • z_map_vars = I (identity — each z directly maps to its reaction variable)

  • z_map_constr_ineq = 0, z_map_constr_eq = 0 (in the primal, knockouts affect variables, not constraints)

prevent_boundary_knockouts: If a variable has lb < 0 or ub > 0, knocking it out to zero requires converting these non-zero bounds into inequality constraints. This prevents situations where a “zero flux” constraint on a reaction with lb = −1000 would be applied incorrectly.

5.3 SUPPRESS Module — Farkas’ Lemma Dual

Function: farkas_dualize(A_ineq_p, b_ineq_p, A_eq_p, b_eq_p, lb_p, ub_p, ...)

The SUPPRESS module wants to make a certain flux subspace infeasible. Farkas’ lemma provides a certificate of infeasibility:

Farkas’ Lemma: The system {A_ineq x b_ineq, A_eq x = b_eq, lb x ub} is infeasible if and only if there exist multipliers y such that:

A_ineq' y_ineq + A_eq' y_eq = 0   (dual feasibility)
b_ineq' y_ineq + b_eq' y_eq < 0   (certificate: b'y < 0)
y_ineq ≥ 0

The Farkas dual is constructed by calling LP_dualize with zero objective (c = 0), yielding dual variables y. Then the constraint c_dual' y −1 is appended (equivalently: b_ineq' y_ineq + b_eq' y_eq −1), which enforces the Farkas infeasibility certificate.

When an intervention (z) removes a primal constraint or variable, the corresponding dual variable is also removed — this is the mechanism by which knockouts “activate” the SUPPRESS condition. When the dual system becomes feasible with the −1 constraint, the corresponding primal flux subspace is infeasible, i.e., the targeted flux state is suppressed.

Known limitation: For the special case A x = b, x ℝ, b 0 (homogeneous equality with unrestricted variable), the Farkas certificate requires b'y 0 rather than < 0. This edge case is not currently implemented (see section 11).

5.4 PROTECT Module

The PROTECT module keeps a flux subspace feasible. It uses the primal LP directly (build_primal_from_cbm) without dualization, then calls reassign_lb_ub_from_ineq to tighten bounds. As long as the primal system is feasible, the protected state is reachable. The objective is set to zero (feasibility only).

5.5 OPTKNOCK — LP Duality for Bilevel Optimization

OptKnock solves:

maximize  outer_objective(v)
s.t.      v ∈ argmax { inner_objective(v) : v feasible }

This bilevel program is converted to a single-level MILP using LP duality:

LP Duality Theorem: For a minimization LP {min c'x : Ax b, x 0}, the dual is {max b'y : A'y c, y 0}. At optimality, primal and dual objectives are equal: c'x* = b'y*.

LP_dualize(A_ineq_p, b_ineq_p, A_eq_p, b_eq_p, lb_p, ub_p, c_p, ...)

Constructs the dual LP. The primal constraints become dual variables and vice versa:

Primal

Dual

x 0

y c_i (inequality constraint)

x

y = c_i (equality constraint)

x 0

y c_i (flipped inequality)

A_ineq row

y_ineq 0

A_eq row

y_eq

Objective b'y

max min, b becomes new objective

The OptKnock MILP block is assembled as:

# Build: [primal (with constraints) | dual (without constraints)]
# Connect: c_inner_primal == c_inner_dual (strong duality)
A_eq_combined = block_diag(A_eq_v, A_eq_dual) stacked with
                hstack(c_primal, c_dual_objective) == 0

The outer objective is set as the MILP objective. The strong duality equality guarantees that v is at the inner optimum.

5.6 ROBUSTKNOCK — Three-Level Duality

RobustKnock solves a min-max problem:

maximize  min_v { outer(v) : v ∈ argmax inner(v) }

This requires two applications of LP duality:

  1. First, the inner bilevel problem {inner(v) at max} is dualized into primal+dual pair (like OptKnock).

  2. The combined primal+dual system is treated as a new primal, and dualized again to convert the outer minimization to a maximization.

The result is a three-layer LP construction assembled into a single MILP block. The outer objective (minimized production, maximized over worst case) becomes the MILP objective.

This is the most complex module and the most sensitive to numerical issues.

5.7 OPTCOUPLE — Growth Coupling Potential

OptCouple maximizes the Growth Coupling Potential (GCP):

GCP = max_v { inner(v) } − max_v { inner(v) | prod = 0 }

The two terms are the max growth with and without production. The objective is their difference.

Construction:

  1. Build primal for the “with production” LP → (A_ineq_p, ..., c_p).

  2. Build primal for the “baseline” LP (no production constraint) → (A_ineq_b, ..., c_b).

  3. Dualize the baseline LP → c_dual_b (dual objective coefficients).

  4. Assemble combined block with strong duality for both LPs.

  5. Final objective: c_p - c_b (growth with production minus baseline).

min_gcp parameter sets a lower bound on GCP in the MILP.

5.9 Big-M Bounding

The quality of big-M values is critical:

  • Too small → valid solutions may be cut off (incorrect results).

  • Too large → numerical instability, especially for GLPK.

StrainDesign computes per-constraint tight bounds via LP. The LP maximizes A[i] x subject to the static (non-knockable) constraints, giving the tightest valid M_i for each knockable constraint.

For GLPK (which has no indicator constraint support), M defaults to the COBRApy bound threshold (default: 1000). This is conservative and may cause numerical issues on models with large flux ranges. Setting M manually via the M parameter allows tuning.

Round-up: computed Ms are ceiled to 5 significant decimal digits.

5.10 Indicator Constraints vs. Big-M

Feature

Indicator Constraints

Big-M

Solver support

CPLEX, Gurobi, SCIP

All (required for GLPK)

Numerical stability

Excellent (no large coefficients)

Can be poor with large M

Solver speed

Typically faster (native branching)

Depends on M tightness

When used

Unbounded constraints (M = ∞)

Bounded constraints (finite M)

Forced by user

M=some_value forces big-M everywhere

Default for GLPK

The IndicatorConstraints class (indicatorConstraints.py) stores:

  • binv: indices of binary variables that trigger each constraint

  • A: coefficient matrix for the triggered constraints

  • b: RHS

  • sense: 'L' (≤), 'E' (=), 'G' (≥)

  • indicval: which value of z (0 or 1) activates the constraint

Each solver backend translates IndicatorConstraints into its native format. For GLPK, the solver interface translates them to big-M constraints using the provided M.

Sign convention in z-maps (subtle!):

  • z_map_constr_ineq[i, j] = +1: z_i = 1 → constraint j removed (knockout)

  • z_map_constr_ineq[i, j] = −1: z_i = 1 → constraint j active (knock-in)

This sign is inverted when KIs vs. KOs are assigned (in z_kos_kis diagonal matrix at end of SDProblem.__init__), so the final indicator constraint indicval correctly maps to 0 (KO: constraint active when z=0) or 1 (KI: constraint active when z=1).


6. Solver Backends

6.1 MILP_LP Unified Interface

File: solver_interface.py Class: MILP_LP

MILP_LP is a factory/dispatcher: its __init__ detects the requested solver and instantiates the appropriate backend class (Cplex_MILP_LP, Gurobi_MILP_LP, etc.). All backends expose the same interface:

class MILP_LP:
    solve()                       # → (x, opt, status)
    slim_solve()                  # → float (objective value only, fast)
    populate(n)                   # → (x_list, opt, status)  [CPLEX/Gurobi only]
    set_objective(c)
    set_ub(ub_list)               # [(idx, val), ...]
    set_time_limit(t)
    add_ineq_constraints(A, b)
    add_eq_constraints(A, b)
    set_ineq_constraint(row, c, b) # Overwrite a specific row
    clear_objective()
    set_objective_idx(idx_val_list)

slim_solve() is used heavily for verification — it runs the LP and returns just the objective value (or nan for infeasible). This avoids extracting the full solution vector.

6.2 CPLEX

File: cplex_interface.py Class: Cplex_MILP_LP

  • Native indicator constraint support (cplex.indicator_constraints.add)

  • Solution pool (populate_solution_pool) for enumerating multiple solutions

  • Thread control via cplex.parameters.threads

  • Seed via cplex.parameters.randomseed

  • Status mapped from CPLEX codes to StrainDesign names (OPTIMAL, INFEASIBLE, etc.)

  • Advantages: Best solution pool support, fastest for large MILPs in practice, population-based enumeration.

  • Limitation: Licensed (academic/commercial). No indicator constraints → automatic big-M fallback.

6.3 Gurobi

File: gurobi_interface.py Class: Gurobi_MILP_LP

  • Native indicator constraint support (gp.Model.addGenConstrIndicator)

  • Solution pool via Gurobi SolCount / PoolSolutions

  • Seed via Seed parameter

  • Status mapped from Gurobi Status codes

  • Advantages: Often fastest solver, excellent solution pool. MIPFocus parameter available for tuning.

  • Note: Requires a Gurobi license (free for academic use).

6.4 SCIP

File: scip_interface.py Class: SCIP_MILP_LP

  • Indicator constraints via pyscipopt addConsIndicator

  • No native solution pool → falls back to iterative compute_optimal

  • Seed via setSeed

  • Advantages: Open-source, no license needed, handles large models well.

  • Limitation: No populate support; POPULATE mode falls back to BEST.

6.5 GLPK

File: glpk_interface.py Class: GLPK_MILP_LP

  • No indicator constraint support — all indicator constraints are translated to big-M at construction time.

  • Uses swiglpk (bundled with COBRApy).

  • No solution pool → iterative enumeration only.

  • Advantages: Always available without additional installation.

  • Limitation: Poorest performance on large MILPs. Sensitive to M value choice. Default M = 1000 (COBRApy bound threshold).


7. Computation Approaches — SDMILP

File: strainDesignMILP.py Class: SDMILP (inherits from SDProblem and MILP_LP)

SDMILP.__init__ calls SDProblem.__init__ to build the MILP matrices, then MILP_LP.__init__ to set up the solver backend.

7.1 ANY

Method: SDMILP.compute(**kwargs)

Finds feasible solutions without enforcing global optimality:

  1. Solve MILP as-is (no cost-minimization objective, just feasibility).

  2. Verify each solution.

  3. Add exclusion constraint.

  4. Repeat until max_solutions or infeasibility.

For MCS problems, z_sum max_cost and z_sum 0 are the only cost constraints. Results may have suboptimal (higher) cardinality compared to BEST.

Warning: For OptKnock/RobustKnock, ANY may return the wild-type (z=0) immediately if that satisfies the outer objective. Use BEST for these methods.

7.2 BEST

Method: SDMILP.compute_optimal(**kwargs)

Finds globally optimal solutions:

  1. Solve MILP with full objective (minimize cost for MCS; maximize outer for OptKnock, etc.).

  2. Verify solution.

  3. Add exclusion constraint (excludes solution and supersets for MCS; only the exact solution for OptKnock via add_exclusion_constraints_ineq).

  4. Fix objective: fixObjective(c_bu, opt) — constrain next solution to have at least the same objective value.

  5. Among solutions with equal objective, minimize intervention cost: setMinIntvCostObjective().

  6. Repeat until max_solutions, infeasibility, or time limit.

For MCS computations, the MILP first checks if the wild-type already satisfies all modules (via verify_sd(sparse.csr_matrix((1, num_z)))).

7.3 POPULATE

Method: SDMILP.enumerate(**kwargs)

Uses the solver’s native solution pool (CPLEX/Gurobi):

  1. Call populateZ(max_solutions) to generate a pool.

  2. Verify all solutions in the pool.

  3. Add exclusion constraints for invalid solutions.

  4. Repeat.

Falls back to compute_optimal for SCIP and GLPK (which lack native pool support).

Gurobi pool notes: Pool quality/diversity can be tuned via Gurobi’s PoolSearchMode and PoolGap parameters, but this is not currently exposed in the interface.

7.4 Exclusion Constraints

Methods: add_exclusion_constraints(z), add_exclusion_constraints_ineq(z)

For a solution with binary vector z:

Case 1 — z is all zeros (wild-type): Add sum(z_i) 1 → forces at least one intervention next time. Implemented as 1 −1 (infeasible if all z = 0).

Case 2 — z has exactly one non-zero entry (single intervention): Set ub[idx] = 0 → permanently disable that intervention.

Case 3 — z has multiple non-zero entries: Integer cut: sum(z_i : i active) sum(z_i_current) 1 This excludes the exact solution and all its supersets (for MCS: if removing r1 and r2 works, we don’t want r1+r2+r3 as a “new” solution since it’s a non-minimal superset).

add_exclusion_constraints_ineq uses a different cut that excludes only the exact solution, not its supersets (used for OptKnock where we want all optima).

7.5 Solution Verification

Method: SDMILP.verify_sd(sols)

After each MILP solution, the binary vector z is verified by building and solving a continuous LP with the corresponding constraint/variable removal applied:

  1. From cont_MILP (saved before link_z): the original LP matrices with z-map information.

  2. For each z_i in the solution: determine which variables (z_map_vars) and constraints (z_map_constr_ineq, z_map_constr_eq) are deactivated.

  3. Build the restricted LP: remove deactivated constraints/variables.

  4. Solve via slim_solve().

  5. If slim_solve() returns nan → infeasible → solution is valid (for SUPPRESS) or invalid (for PROTECT).

This catches numerical artifacts where the MILP returns a slightly non-integer solution that passes rounding but does not actually satisfy the biological constraint.

cont_MILP is the continuous pre-link_z version of the problem. It preserves the structural relationship between z-variables and constraints/variables before big-M coefficients obscure it. This is why the continuous validation is numerically robust even if M values were imprecise.


8. Post-processing & Result Container

8.1 Solution Decompression

Function: expand_sd(sd, cmp_mapReac) in networktools.py

After solving on the compressed model, solutions refer to compressed reaction IDs. expand_sd maps each compressed intervention back to the set of original reactions it represents (using cmp_mapReac).

  • A single compressed KO may expand to multiple original reaction KOs (coupled/parallel reactions).

  • The expanded solution preserves the cost structure.

Function: filter_sd_maxcost(sd, max_cost, uncmp_ko_cost, uncmp_ki_cost) After expansion, the cost of a solution in original space may differ from compressed space (due to parallel reactions with non-unit costs). Solutions exceeding max_cost in original space are filtered out.

8.2 GPR Translation

SDSolutions translates gene interventions to reaction phenotypes using gpr_eval.

For each gene-level solution gene_sd:

  1. Collect all affected reactions from the model’s GPR rules.

  2. Evaluate each reaction’s GPR with the gene states using gpr_eval.

  3. Reactions whose GPR evaluates to False (given knockouts) are collected as reaction knockouts.

  4. Reactions requiring knocked-in genes that are True (given additions) are collected as reaction additions.

get_gene_reac_sd_assoc() returns three parallel lists:

  • reac_sd: reaction-level intervention dicts

  • assoc: association dict showing which genes caused which reaction change

  • gene_sd: the original gene interventions

8.3 SDSolutions

File: strainDesignSolutions.py

SDSolutions stores:

  • reaction_sd: list of dicts {reac_id: -1 | 0 | +1} for each solution

  • gene_sd: list of gene-level dicts (if gene interventions were used)

  • status: solver status string

  • sd_setup: full setup dict (model_id, modules, costs, solver, etc.) — can be serialized to JSON and reloaded

Serialization: sd_setup is JSON-compatible, allowing CNApy to save .sd files that can be reloaded into compute_strain_designs via the sd_setup dict parameter.


9. Constraint & Expression Parsing

File: parse_constr.py

Handles user-supplied constraints and objectives in multiple formats:

Input formats (all equivalent):

# String
"r1 + 3*r2 = 0.3, -5*r3 - r4 <= -0.5"

# List of strings
["r1 + 3*r2 = 0.3", "-5*r3 - r4 <= -0.5"]

# List of structured constraints
[[{'r1': 1.0, 'r2': 3.0}, '=', 0.3], [{'r3': -5.0, 'r4': -1.0}, '<=', -0.5]]

Key functions:

Function

Input → Output

parse_constraints(constr, reaction_ids)

Any format → [[dict, sense, rhs], ...]

parse_linexpr(expr, reaction_ids)

Any format → [dict, ...]

lineq2mat(equations, reaction_ids)

String list → (A_ineq, b_ineq, A_eq, b_eq) sparse matrices

lineqlist2mat(eqlist, reaction_ids)

Structured list → sparse matrices

linexpr2dict(expr, reaction_ids)

String → {reac_id: coeff}

linexprdict2mat(expr_dict, reaction_ids)

Dict → sparse row vector

The parse_constraints function uses regex splitting on ',' and '\n' to separate multiple constraints in a single string, then dispatches to lineq2list which uses further regex for coefficient/identifier/operator extraction.


10. Supporting Utilities

lptools.py

  • fva(model, variables, constraints, solver): Flux variability analysis. Parallelized via SDPool using fva_worker_init/fva_worker_compute. Returns a pandas.DataFrame with min/max flux values.

  • select_solver(solver, model): Selects solver based on availability, preference, and model settings. Priority: CPLEX > Gurobi > SCIP > GLPK.

  • plot_flux_space(model, ...): 2D/3D convex hull visualization using scipy.spatial.ConvexHull and Matplotlib.

  • DisableLogger: Context manager to suppress logging during FVA.

pool.py — SDPool

Windows-compatible multiprocessing.Pool subclass. On Windows, the initializer function is pickled to a temporary file and loaded by worker processes, avoiding a performance regression from passing the initializer directly (COBRApy issue #997). Uses 'spawn' context universally to avoid fork-related instability.

names.py

All string constants. Importing with from straindesign.names import * is used throughout the codebase. Constants include module type strings, solver names, status codes, and parameter keys. This single-file approach ensures consistent naming.

Notable: PROTECT = 'mcs_lin' and SUPPRESS = 'mcs_bilvl' are defined first (legacy names), then immediately overwritten with 'protect' and 'suppress'. This is vestigial code — the legacy names are not used and could be removed.

__init__.py

Detects available solvers at import time:

avail_solvers = set()
try: import cplex; avail_solvers.add('cplex')
try: import gurobipy; avail_solvers.add('gurobi')
try: import pyscipopt; avail_solvers.add('scip')
try: import swiglpk; avail_solvers.add('glpk')

Exports: all public classes, functions, and the avail_solvers set.


11. Known Issues, Urgent Actions & Future Work

11.1 Urgent Action Items

⚠ Farkas special case (homogeneous equality): In farkas_dualize, there is a documented comment noting the unimplemented edge case:

“In the case of (1) A x = b, (2) x ∈ ℝ, (3) b ≈ 0, Farkas’ lemma is special, because b'y 0 is required…”

This could cause silent incorrect results for SUPPRESS modules with equality constraints where b 0 (e.g., free reactions with zero RHS). Impact: depends on whether the compressed, preprocessed models trigger this case. A defensive check + explicit splitting into two inequalities would fix it.

⚠ Big-M for GLPK with large flux ranges: GLPK uses M = cobra_bound_threshold = 1000 by default. For models with non-standard bounds (e.g., ATPM = 8.39, but oxygen intake up to 10,000), this can produce incorrect strain designs. Models should always have remove_dummy_bounds applied, but verifying this is enforced is important.

add_exclusion_constraints_ineq has a bug:

A_ineq = [1.0 if z[j, i] else -1.0 for i in self.idx_z]
A_ineq.resize((1, self.A_ineq.shape[1]))  # ← list.resize doesn't exist

list.resize is a NumPy array method, not a list method. This would raise an AttributeError at runtime. This method is not used in the current main computation paths (OptKnock uses add_exclusion_constraints), but should be fixed. The fix: convert to np.array before calling .resize.

⚠ Legacy name double-assignment in names.py:

PROTECT = 'mcs_lin'
SUPPRESS = 'mcs_bilvl'
PROTECT = 'protect'      # overwrites silently
SUPPRESS = 'suppress'    # overwrites silently

The first assignments are dead code and should be removed to avoid confusion.

11.2 Low-Hanging Fruits

Solution space plotting issues (reported): lptools.plot_flux_space uses scipy.spatial.ConvexHull and Matplotlib. Known issues include:

  • 3D plots with degenerate hulls (lower-dimensional polytopes) cause QhullError.

  • Color scaling in plot_flux_space_colored may fail for single-point solutions.

  • Fix: add degeneracy checks and fallback to scatter plots for ≤ 2 unique points.

Populate mode for SCIP/GLPK: Currently silently falls back to BEST. Should log a clear warning and document that SCIP/GLPK do not support POPULATE.

Gurobi pool tuning: The populate method in gurobi_interface.py could expose PoolSearchMode=2 (focused enumeration) for better solution diversity. Currently uses default settings.

RobustKnock + POPULATE: RobustKnock with POPULATE is not well-tested. The three-level dual structure is particularly sensitive to solver pool heuristics. Add explicit test cases.

Parallel FVA compression for large models: For genome-scale models (> 2000 reactions), compression can be slow. The RREF step in RationalMatrix has O(n²) row operations. Potential improvement: use block-sparse RREF or exploit model structure (e.g., subsystems).

Compression for FVA bounds: Currently, FVA is run on the full model before compression. Running FVA after compression (on the smaller model) would be faster. The challenge is that some blocked reactions are only detectable post-compression.

Essential KI detection: essential_kis is computed but may not always be passed correctly through the compression/GPR pipeline. Verify that the set persists through compress_ki_ko_cost.

11.3 Future Development Directions

Nullspace-based primal/dual formalization: The referenced paper (DOI: 10.1093/bioinformatics/btz393, von Kamp & Klamt 2019) presents a nullspace-based approach to computing minimal cut sets. Integrating this formalization could:

  • Avoid explicit dualization for certain model types.

  • Give a more compact MILP representation.

  • Improve numerical conditioning.

A key idea: the primal flux space has a right nullspace (internal cycles). The Farkas dual operates in the left nullspace. Exploiting this duality directly, rather than constructing full dual matrices, could reduce MILP size significantly.

Variable splitting: Reversible reactions split into forward/reverse can sometimes improve MILP performance (tighter LP relaxation). This is a pre-processing step that should be optional and tested for impact on different model types.

Tighter big-M bounds via duality: Currently, per-constraint big-M values are computed by solving one LP per knockable constraint. This is parallelized but still expensive for large models. An alternative: use dual variable magnitudes from the full LP relaxation to derive component-wise M bounds analytically, without solving per-constraint LPs.

MILP warm-starting: When BEST generates multiple solutions iteratively, each call starts cold. Warm-starting from the previous solution’s LP relaxation (available in CPLEX/Gurobi) could speed up subsequent solves significantly.

Improving RobustKnock numerics: The three-level dual construction in ROBUSTKNOCK creates very large matrices with potential for numerical blow-up. Applying scaling (row/column normalization) to the LP before and after dualization could help. Alternatively, use iterative refinement at the inner LP level.

Integration of regulatory networks beyond simple T/F interventions: The current regulatory intervention model (active/inactive) is binary. Future work could integrate thermodynamic constraints or kinetic feasibility checks.

Improved CNApy integration: The sd_setup dict format is designed for CNApy interoperability. Ensuring full round-trip serialization (JSON → compute_strain_designs → JSON) with all parameter types (modules, costs, constraints) is an ongoing compatibility concern.


12. Testing

Tests are in tests/. They are run with:

pytest tests -v --log-cli-level=INFO --junit-xml=test-results.xml

CI matrix (.github/workflows/CI-test.yml):

  • OS: ubuntu-latest, windows-latest

  • Python: 3.10, 3.11, 3.12, 3.13

  • Package managers: pip, conda

  • CPLEX excluded for Python 3.13 (max supported: 3.12)

  • Segfault on Ubuntu (JPype JVM shutdown) handled via JUnit XML exit-code check

Key test areas to verify after changes:

  1. Constraint parsing — any change to parse_constr.py must be verified against all input formats.

  2. MILP construction — after changes to link_z, build_primal_from_cbm, or dualization functions: run with a small toy model and verify solutions against known correct answers.

  3. Compression — changes to compression.py or networktools.py must verify that expand_sd correctly reconstructs original-space solutions.

  4. Solver backends — changes to any *_interface.py file require testing with the specific solver installed.

  5. GPR translation — changes to networktools.extend_model_gpr or SDSolutions.gpr_eval must be tested with models that have AND/OR GPR logic (e.g., iJO1366 for E. coli).

Adding new tests:

  • Use a small toy model (3-5 reactions) for unit tests of MILP construction.

  • Use e_coli_core for integration tests of the full pipeline.

  • Use iJO1366 for performance regression tests (track solve time).

  • Always test with at least GLPK (always available) and one commercial solver if possible.

Performance profiling:

import cProfile
cProfile.run("compute_strain_designs(model, sd_modules=[...], solver='glpk')", 'profile_out')
import pstats
pstats.Stats('profile_out').sort_stats('cumulative').print_stats(30)

The hot spots are typically: link_z (LP bounding), fva (preprocessing), and the solver’s solve() loop.


Quick Reference: Data Flow Through Key Functions

compute_strain_designs(model, **kwargs)
    ├── remove_ext_mets(model)
    ├── remove_dummy_bounds(model)
    ├── bound_blocked_or_irrevers_fva(model, ...)
    ├── [if gene_kos] rename_genes(model, ...)
    ├── [if gene_kos] remove_irrelevant_genes(model, essential_reacs, gkis, gkos)
    ├── [if gene_kos] extend_model_gpr(cmp_model, ...) → reac_map
    ├── [if reg]      extend_model_regulatory(cmp_model, ...) → reg_map
    ├── compress_model(cmp_model, ...) → cmp_mapReac
    ├── compress_modules(sd_modules, cmp_mapReac)
    ├── compress_ki_ko_cost(ko_cost, ki_cost, cmp_mapReac)
    ├── SDMILP(cmp_model, sd_modules, ko_cost, ki_cost, solver, M, ...)
    │       ├── SDProblem.__init__(...)
    │       │       ├── [for each module] addModule(sd_module)
    │       │       │       ├── build_primal_from_cbm(model, V_ineq, v_ineq, ...)
    │       │       │       ├── [SUPPRESS] farkas_dualize(...)
    │       │       │       ├── [PROTECT]  reassign_lb_ub_from_ineq(...)
    │       │       │       ├── [OPTKNOCK] LP_dualize(...) + block_diag assembly
    │       │       │       ├── [ROBUSTKNOCK] LP_dualize x2
    │       │       │       └── [OPTCOUPLE] LP_dualize + GCP objective
    │       │       └── link_z()
    │       │               ├── split equalities → inequalities
    │       │               ├── variable KOs → inequality KOs
    │       │               ├── parallel LP bounding → big-M values
    │       │               ├── insert big-M into A_ineq
    │       │               ├── lump duplicate pairs → equalities
    │       │               └── remaining → IndicatorConstraints
    │       └── MILP_LP.__init__(c, A_ineq, ..., solver=solver)
    │               └── instantiate Cplex/Gurobi/SCIP/GLPK backend
    ├── [BEST]     sdmilp.compute_optimal(max_solutions, time_limit)
    ├── [ANY]      sdmilp.compute(max_solutions, time_limit)
    ├── [POPULATE] sdmilp.enumerate(max_solutions, time_limit)
    │       each iteration:
    │       ├── solveZ() / populateZ(n)
    │       ├── verify_sd(z) → valid[]
    │       └── add_exclusion_constraints(z)
    ├── expand_sd(sd, cmp_mapReac)
    ├── filter_sd_maxcost(sd, max_cost, ...)
    ├── postprocess_reg_sd(sd, ...)
    └── SDSolutions(model, sd, status, sd_setup)
            └── [if gene_kos] translate gene_sd → reaction_sd via gpr_eval