Navigating Experimental Failures: A Practical Guide to Robust Bayesian Optimization in Biomedical Research

Robert West Dec 02, 2025 211

Bayesian optimization (BO) is a powerful, sample-efficient method for guiding expensive experiments, but its real-world application is often hampered by a pervasive issue: experimental failures.

Navigating Experimental Failures: A Practical Guide to Robust Bayesian Optimization in Biomedical Research

Abstract

Bayesian optimization (BO) is a powerful, sample-efficient method for guiding expensive experiments, but its real-world application is often hampered by a pervasive issue: experimental failures. These failures, arising from failed syntheses, unstable compounds, or equipment issues, create missing data that can derail standard BO. This article provides a comprehensive guide for researchers and drug development professionals on handling these unknown constraints and failures. We explore the foundational causes of failures in scientific domains, detail state-of-the-art methodological solutions like feasibility-aware acquisition functions and the 'floor padding trick,' and troubleshoot common pitfalls such as model misspecification and boundary oversampling. Through validation against real-world benchmarks from materials science and drug discovery, we demonstrate how robust BO strategies can accelerate the search for optimal conditions while safely navigating infeasible regions, ultimately enhancing the reliability and efficiency of autonomous experimentation in biomedical research.

The Inevitability of Failure: Understanding Unknown Constraints in Scientific Experimentation

In the application of Bayesian optimization (BO) to experimental science, researchers frequently encounter a critical roadblock: experimental failure. Unlike optimization in purely computational domains where every parameter combination yields a result, physical experiments can fail catastrophically, providing no useful data about the objective function. Within the context of BO, an experimental failure is specifically defined as an evaluation attempt for a parameter set x that does not yield a measurable objective function value y, preventing its use in updating the regression surrogate model [1] [2]. These failures arise from a priori unknown constraints—regions in parameter space that violate unmodeled physical, chemical, or technical limitations of the experimental system [2]. Handling these failures is not merely a technical inconvenience but a fundamental requirement for efficient autonomous experimentation, as they provide critical information about the boundaries of feasible parameter space.

Classification and Characteristics of Experimental Failures

Experimental failures in BO can be categorized through a formal taxonomy based on the nature of the constraint function, c(x), which defines the feasible region X ⊆ Ω where the objective function can be evaluated [2]. The most pertinent category for experimental sciences comprises unknown constraints, characterized by several key properties.

Non-Quantifiable: The experiment returns only binary information (success/failure) without indicating how close the parameters were to the feasibility boundary [2].
Unrelaxable: The constraint must be satisfied for an objective measurement to be obtained. A failed experiment means y is simply unavailable [2].
Simulation (or Measurement) Constraint: Evaluating the constraint incurs a non-negligible cost, similar to the objective function itself, as it requires executing the (often costly) experimental procedure [2].
Hidden: The constraint is not explicitly known to the researcher before commencing the optimization campaign [2].

Table 1: Properties of Common Experimental Failure Types in Scientific Applications

Failure Mode	Constraint Type	Impact on Objective Evaluation	Example from Literature
Failed Synthesis/Reaction	Unknown, Unrelaxable	No property measurement possible	SrRuO3 thin film phase not formed during ML-MBE; molecule synthesis fails in drug discovery [1] [2].
Equipment/Instrument Limitation	Unknown, Unrelaxable	Measurement cannot be performed or is invalid	Sensor fault in Organic Rankine Cycle systems; instrument sensitivity limits [2] [3].
Material Property Violation	Unknown, Unrelaxable	Property measurement is precluded	Material is too fragile for characterization; insufficient photoluminescence for analysis [2].
Safety/Operational Boundary	Unknown or Known, Unrelaxable	Experiment is aborted or produces dangerous outcome	Charge delivery curve in neuromodulation causing adverse effects; unstable process conditions [4] [2].

Quantifying the Impact: Failure Statistics and Optimization Performance

The prevalence of experimental failures significantly impacts the sample efficiency and success of BO campaigns. Data from real-world applications demonstrate that failures are not edge cases but common occurrences.

Table 2: Documented Experimental Failure Rates in Bayesian Optimization Studies

Application Domain	Reported Failure Rate	Primary Cause of Failure	Impact on BO Efficiency
Materials Growth (SrRuO3)	Handled explicitly in algorithm	Target phase not formed	Addressed via "floor padding trick"; successful optimization in 35 runs [1].
Polymer Compound Development	Implied by complex feasibility	Opposition between Young's Modulus and Impact Strength	Over-complication with expert knowledge initially impaired BO performance [5].
Neuromodulation (Simulated)	Implied by safety boundaries	Parameter combinations near safety/charge limits	Standard BO prone to oversampling boundaries; required mitigation strategies [4].

Simulation studies further reveal performance degradation of standard BO algorithms as the proportion of infeasible space increases. Naive strategies, such as ignoring failure data or assigning a constant penalty, can lead to suboptimal performance, including excessive sampling of infeasible regions or convergence to local optima [1] [2]. The performance of failure-handling algorithms is often measured by the number of valid experiments required to find a feasible optimum and the best objective value achieved over the course of the optimization [2].

Detailed Experimental Protocols for Failure Handling

Protocol 1: The Floor Padding Trick with Gaussian Process BO

This protocol is designed for material growth and synthesis optimization where failures are common [1].

1. Objective: To optimize an experimental objective (e.g., residual resistivity ratio) while handling synthesis failures that prevent measurement.
2. Materials and Reagents:
- Molecular Beam Epitaxy (MBE) system or analogous synthesis apparatus.
- Precursor materials (e.g., Sr, Ru, O₂ for SrRuO3).
- Characterization equipment (e.g., X-ray diffractometer, electrical transport measurement system).
3. Procedure:
- Step 1 - Initialization: Select a small number (e.g., 5) of initial parameter points X = {x₁, ..., xₙ} via random sampling or space-filling design.
- Step 2 - Experiment and Evaluation:
  - For each proposed point xₙ, execute the synthesis procedure.
  - IF synthesis is successful and the target material is formed:
    - Measure the objective value yₙ = S(xₙ) + ε.
  - ELSE (Experimental Failure):
    - Assign yₙ = min(Y), where Y is the set of all successfully measured objective values obtained so far. This is the "floor padding" [1].
- Step 3 - Model Update: Update the Gaussian Process (GP) surrogate model with the new data point (xₙ, yₙ), regardless of whether yₙ is a genuine measurement or a padded value.
- Step 4 - Acquisition and Suggestion: Using the updated GP, maximize an acquisition function (e.g., Expected Improvement) to propose the next experiment xₙ₊₁.
- Step 5 - Iteration: Repeat steps 2-4 until a termination criterion is met (e.g., budget exhaustion, performance target reached).
4. Analysis and Notes:
- This method provides an adaptive, worst-case penalty that helps the algorithm avoid regions near failures.
- It leverages all experimental outcomes, including failures, by updating the GP model, guiding subsequent exploration.

Protocol 2: Feasibility-Aware BO with Variational GP Classification

This protocol uses a classifier to explicitly model the probability of failure, suitable for applications with significant infeasible regions like molecule design [2].

1. Objective: To find parameters x that optimize an objective f(x) while respecting an a priori unknown constraint c(x) ≥ 0.
2. Materials and Reagents:
- Automated synthesis platform (e.g., for organic molecules).
- Analytical equipment for property verification (e.g., HPLC, mass spectrometer).
- Reagents and starting materials for synthesis.
3. Procedure:
- Step 1 - Initialization: Define parameter space and acquire initial data set D = {(xᵢ, yᵢ, ỹᵢ)} for i = 1,...,N, where ỹᵢ is a binary label (1 for feasible, 0 for infeasible).
- Step 2 - Surrogate Modeling:
  - Train a regression GP model on the feasible data (where ỹᵢ = 1) to model the objective f(x).
  - Train a separate Variational Gaussian Process (VGP) classifier on the entire dataset D to model the probability of feasibility, p(ỹ = 1 | x) [2].
- Step 3 - Feasibility-Aware Acquisition:
  - Calculate a feasibility-weighted acquisition function, such as the Expected Improvement with Constraint (EIC):
    - EIC(x) = EI(x) × p(ỹ = 1 | x) [2].
  - Here, EI(x) is the standard Expected Improvement from the regression GP.
- Step 4 - Experiment and Evaluation:
  - Propose the next point xₙ₊₁ by maximizing EIC(x).
  - Execute the experiment at xₙ₊₁.
  - Record both the feasibility label ỹₙ₊₁ and, if feasible, the objective value yₙ₊₁.
- Step 5 - Iteration: Update the datasets and both surrogate models. Repeat steps 2-4 until termination.
4. Analysis and Notes:
- This method explicitly balances the pursuit of high-performance points with the avoidance of likely failures.
- It is particularly effective when the feasible region is small and complexly shaped.

Visualization of Failure-Handling Bayesian Optimization Workflows

The following diagrams illustrate the core logical structure of BO workflows that incorporate experimental failure handling.

Diagram 1: General workflow for BO with experimental failure handling, illustrating the critical decision point after each experiment and the two paths for successful and failed trials.

Diagram 2: A taxonomy of 'experimental failure' in BO, showing its relationship to unknown constraints, its defining characteristics, and common real-world examples.

The Scientist's Toolkit: Key Reagents and Materials

Table 3: Essential Research Reagents and Materials for Featured BO Experiments

Item Name	Function/Application	Example Use Case
Virgin & Recycled Polymers	Base materials for compound formulation with variable properties.	Optimizing polymer compound properties (MFR, Young's modulus) [5].
Impact Modifier & Filler	Additives to modify specific mechanical properties of a polymer compound.	Balancing impact strength and stiffness in recycled plastic compounds [5].
Molecular Beam Epitaxy (MBE) System	High-precision thin film deposition system for materials growth.	Growing high-quality SrRuO3 thin films for electrode applications [1].
Metalorganic Precursors (Sr, Ru, O₂)	Source materials for the growth of oxide thin films in an MBE system.	Forming the perovskite crystal structure of SrRuO3 [1].
Organic Synthesis Platform	Automated system for performing chemical reactions and synthesizing molecules.	High-throughput synthesis of drug candidates (e.g., BCR-Abl kinase inhibitors) [2].
Cyclopentane Working Fluid	Organic fluid used in the Rankine cycle for waste heat recovery.	Serving as the working medium in an ORC system for sensor fault diagnosis studies [3].

In scientific domains such as materials science and drug development, optimizing processes via experimental campaigns is fundamentally hampered by experimental failures. These failures manifest when a suggested experiment cannot be evaluated, yielding no useful data for the objective function. Within the framework of Bayesian optimization (BO)—a sample-efficient, sequential global optimization strategy—such occurrences present a significant challenge, as they can stall the optimization loop and waste precious resources [1]. This article establishes a taxonomy of these failures, categorizing them primarily into synthetic inaccessibility and measurement limitations, and provides structured protocols for handling them within a BO campaign, leveraging the latest research in the field.

A Taxonomy of Experimental Failures

Experimental failures in optimization campaigns can be systematically classified. The following table outlines the core categories and their characteristics.

Table 1: Taxonomy of Experimental Failures in Bayesian Optimization

Failure Category	Description	Common Examples in Research	Impact on BO
Synthetic Inaccessibility / Unknown Feasibility Constraints	The proposed experimental parameters lie in a region of the search space where the target material cannot be synthesized, the chemical reaction fails, or the target molecule is unstable or unsynthesizable [1] [6].	Failed thin-film growth in molecular beam epitaxy (MBE); unstable hybrid organic-inorganic halide perovskites; unsynthesizable molecular structures in drug design [1] [6].	Results in a "missing" or invalid data point. The algorithm must learn to avoid this infeasible region.
Measurement Limitations	The experiment is conducted, but a technical fault prevents a valid measurement of the property of interest from being obtained.	Equipment malfunction; sample degradation during measurement; software errors in data acquisition [7].	Wastes an experimental cycle without yielding an objective function value.
System-Level Failures (IoT/ Automated Labs)	Failures arising from the complex, distributed hardware and software systems that operate a self-driving laboratory (SDL). These are particularly relevant to integrated, automated workflows [7].	A single component (e.g., a robotic arm, sensor, or software controller) in an IoT-based lab fails, causing a cascade that aborts the experiment [7].	Halts the entire automated workflow until the failure is diagnosed and rectified.

Bayesian Optimization Algorithms for Handling Failures

The core challenge is to adapt the BO procedure to learn from failures, not just successes. The surrogate model must be updated, and the acquisition function must balance the exploration of promising regions with the avoidance of known failures. Several key strategies have been developed.

The Floor Padding Trick

This method, introduced for high-throughput materials growth, is a simple yet powerful data imputation technique [1]. When an experimental trial for parameter vector ( xn ) results in a failure, the evaluation ( yn ) is complemented with the worst value observed so far in the campaign: ( yn = \min{1 \leq i < n} y_i ) [1].

Rationale: It provides the BO algorithm with a strong, adaptive signal that the attempted parameter set performed poorly, encouraging exploration away from that region without requiring manual tuning of a penalty constant [1].
Protocol:
- Maintain a running list of all successfully measured objective function values.
- Upon encountering an experimental failure, identify the minimum value from the list of successful observations.
- Assign this minimum value to the failed experiment's output.
- Proceed with the standard BO update of the Gaussian Process surrogate model using this imputed data point.

Feasibility-Aware BO with Binary Classification

A more sophisticated approach involves explicitly modeling the probability of failure using a classifier, often a variational Gaussian process classifier, that is learned on-the-fly [6]. This model predicts whether a given parameter set ( x ) will lead to a feasible (successful) experiment.

Rationale: It directly learns the unknown feasibility constraint boundary, allowing the acquisition function to actively avoid regions predicted to be infeasible [6].
Protocol:
- Data Collection: Build a dataset of parameters ( x ) and their corresponding feasibility labels (success or failure).
- Model Training: Train a Gaussian process classifier on this dataset to estimate the probability of feasibility, ( p(\text{feasible} | x) ).
- Feasibility-Aware Acquisition: Modify a standard acquisition function, ( \alpha(x) ), to incorporate the feasibility prediction. A common method is the Product of Expectations: ( \alpha_{\text{feas}}(x) = p(\text{feasible} | x) \cdot \mathbb{E}[\alpha(x) | \text{feasible}] ) This function balances high expected performance with a high probability of success [6].
- Sequential Update: With each new experiment (success or failure), update both the regression surrogate model for the objective and the classification model for feasibility.

Integrated Workflow

The following diagram illustrates the logical workflow of a BO loop that integrates both the floor padding trick and a feasibility classifier to handle experimental failures.

Experimental Protocols

Protocol 1: Optimizing Materials Synthesis with Unknown Stability Constraints

This protocol is adapted from studies on optimizing the growth of SrRuO3 thin films and the inverse design of hybrid perovskites [1] [6].

Objective: To find the growth parameters ( x ) (e.g., temperature, pressure, flux ratios) that maximize a target property ( y ) (e.g., Residual Resistivity Ratio, RRR) while handling failed growth runs.

Materials and Reagents:

Molecular Beam Epitaxy (MBE) system or other relevant thin-film deposition system.
Precursor materials for the target material (e.g., Sr, Ru, O2 for SrRuO3).
Single-crystal substrates.

Procedure:

Initialization:
- Define a wide, multi-dimensional parameter space based on literature and expert knowledge.
- Initialize the BO campaign using a space-filling design (e.g., Latin Hypercube Sampling) or a more advanced method like HIPE [8] for 5-10 initial experiments.

Sequential BO Loop: a. Suggestion: Use a feasibility-aware acquisition function (see Section 3.2) to suggest the next parameter set ( xn ). b. Execution: Attempt to grow the thin film using the suggested parameters ( xn ) in the MBE system. c. Evaluation: - Success: If a coherent, single-phase film is confirmed (e.g., via in-situ reflection high-energy electron diffraction), proceed to measure the target property ( yn ) (e.g., RRR). - Failure: If the film is not formed or is polycrystalline/amorphous, classify the run as a failure. Apply the floor padding trick, setting ( yn ) to the worst RRR value recorded from successful runs so far [1]. d. Update: Update the Gaussian Process regression model with the new data point ( (xn, yn) ). Simultaneously, update the binary feasibility classifier with the new feasibility label for ( x_n ).
Termination: Continue until a predefined performance threshold is met, a maximum number of experiments is reached, or the system converges.

Protocol 2: Drug Design with Synthetic Accessibility Constraints

This protocol is informed by benchmarks involving the design of BCR-Abl kinase inhibitors with unknown synthetic accessibility constraints [6].

Objective: To find a molecular structure ( x ) that maximizes a desired property (e.g., binding affinity, selectivity) while being synthetically accessible.

Materials and Reagents:

Virtual chemical library or generative molecular model.
Software for predicting molecular properties (e.g., docking software for binding affinity).
(For validation) Chemical reagents and laboratory equipment for organic synthesis.

Procedure:

Problem Formulation:
- Represent the molecular design space via a continuous latent space (e.g., from a variational autoencoder) or using molecular descriptors.
- Define the objective function, ( f(x) ), as a composite score incorporating predicted bioactivity and other ADMET properties.

Sequential BO Loop: a. Suggestion: The acquisition function suggests a candidate molecule ( xn ). b. Feasibility Check: A synthetic accessibility (SA) predictor, which is a binary classifier updated in real-time, evaluates ( xn ). - If ( p(\text{synthesizable} | x_n) ) is below a threshold, the acquisition function is penalized, and the molecule may be rejected. c. Evaluation: - Success (Virtual): If deemed synthesizable, the molecule's properties are evaluated via computational prediction (e.g., docking score). - Failure (Virtual): If the SA predictor flags the molecule as unsynthesizable, it is recorded as a failure. A penalty (e.g., a very low objective value or floor-padded value) is assigned [6]. d. Update: The regression model for the objective and the SA classifier are updated with the new outcome.
Validation:
- The top-performing, synthetically accessible candidates identified by the BO campaign are subjected to actual laboratory synthesis and experimental testing to validate the predictions.

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools for Failure-Aware Bayesian Optimization

Item	Function in the Context of Failure Handling
Gaussian Process (GP) Regression Library (e.g., GPyTorch, scikit-learn)	Serves as the core surrogate model for modeling the objective function. It is updated with imputed values from failures via the floor padding trick [1].
Variational Gaussian Process (VGP) Classifier	Used to model the unknown feasibility constraint function (probability of failure) on-the-fly, a key component of feasibility-aware BO as in the Anubis framework [6].
Bayesian Optimization Suite (e.g., BoTorch, Ax, Atlas)	Provides the infrastructure for defining the optimization problem, combining regression and classification models, and implementing custom acquisition functions like HIPE or feasibility-aware EI [6] [8].
Automated Laboratory Equipment / Self-Driving Lab (SDL)	The physical (or virtual) platform where experiments are executed. Its reliability is crucial; system-level failures can be analyzed using integrated frameworks like Model-Based Systems Engineering (MBSE) and Fault Tree Analysis (FTA) [7].
Fault Tree Analysis (FTA) & Bayesian Network (BN) Models	Used for quantitative failure analysis of the integrated IoT systems within an automated lab, helping to identify and prioritize the weakest links in the experimental hardware/software pipeline [7].

In real-world experimental sciences, from materials growth to drug development, the optimization of complex processes is frequently hampered by experimental failures. These failures result in missing data, a problem that severely impedes traditional Bayesian optimization (BO) frameworks. Standard BO algorithms operate under the assumption that every suggested parameter configuration can be evaluated and will yield a meaningful quantitative result. However, in practice, many experiments fail entirely—synthesis reactions yield no target compound, thin films fail to crystallize properly, or biological assays produce inconclusive results. These scenarios create fundamental challenges for the Gaussian process (GP) surrogate models at the heart of BO, which require complete datasets to build accurate representations of the underlying objective function. When experimental failures are treated as simple omissions, the surrogate model's uncertainty estimates become miscalibrated, and the acquisition function begins to suggest suboptimal or repeatedly failing parameters. This article examines the mechanistic reasons for standard BO's failure in the presence of missing data and presents advanced methodological adaptations that transform this challenge into a tractable problem.

The Mechanistic Breakdown: How Missing Data Compromises Standard BO

The Surrogate Model's Dependency on Complete Data

The Gaussian process surrogate model functions by establishing a covariance structure across the entire parameter space based on observed data points. Each successful evaluation informs the model about the objective function's behavior in its vicinity. Missing data creates "holes" in this structure—regions where the model lacks direct evidence about whether parameters yield good results or simply fail. Consequently, the model's posterior mean and variance in these regions become poorly calibrated. The GP may extrapolate inappropriately across failure zones, leading to misguided predictions.

The Acquisition Function's Misguided Exploration-Exploitation Balance

Acquisition functions like Expected Improvement (EI) and Upper Confidence Bound (UCB) rely on the surrogate model's predictions to balance exploring uncertain regions with exploiting promising ones. When experimental failures are treated as missing observations:

Over-exploration of failure regions: If failures are simply omitted from the dataset, the surrogate model maintains high uncertainty in these regions. Acquisition functions like UCB, which explicitly favor high-uncertainty areas, may repeatedly sample from parameter configurations that consistently fail [4].
Failure to recognize constraint boundaries: Missing data often occurs at parameter boundaries where conditions become physically unrealizable. Standard BO lacks mechanisms to infer that these regions should be avoided, wasting experimental resources on invalid configurations.

Table 1: Impact of Missing Data on Standard BO Components

BO Component	Function in Standard BO	Impact of Missing Data
Gaussian Process Surrogate	Models the objective function across parameter space	Creates inaccurate posterior distributions with poor extrapolation across failure zones
Acquisition Function	Balances exploration and exploitation to select next parameters	Suggests points in failure regions due to improperly high uncertainty estimates
Experimental Iteration Loop	Sequentially improves model with new data	Wastes resources on failed experiments, slowing convergence

Advanced Methodologies for Handling Missing Data in BO

The Floor Padding Trick: A Simple Yet Effective Approach

A straightforward but powerful method for handling experimental failures is the "floor padding trick" [1]. This approach assigns a penalty value to failed experiments that actively discourages the algorithm from sampling nearby regions. The implementation is refreshingly simple: when an experiment fails, the missing evaluation is imputed with the worst observed value obtained from successful experiments up to that point.

Mechanism and Workflow:

Conduct experiments at suggested parameters ( x_n ).
For successful experiments, record the objective function value ( y_n ).
For failed experiments, assign ( yn = \min{1 \leq i < n} y_i ) (the worst value observed so far).
Update the surrogate model with this completed dataset.

This method provides two critical benefits: it supplies the surrogate model with information that the attempted parameters performed poorly, and it creates a gradient that steers future sampling away from failure regions. The approach is adaptive and automatic, requiring no predetermined penalty values that might require delicate tuning [1].

Binary Classification for Failure Prediction

A more sophisticated approach involves training a binary classifier alongside the regression surrogate model to explicitly predict the probability of experimental failure for any given parameter set [1].

Implementation Protocol:

Data Collection: Maintain a dataset ( D = {(xi, si, yi)} ) where ( si ) is a binary success/failure indicator.
Model Training:
- Train a GP classifier on ( {(xi, si)} ) to predict failure probability ( p(fail|x) ).
- Train a GP regressor only on successful experiments ( {(xi, yi) | s_i = success} ) to model the objective function.
Acquisition Modification: Modify the acquisition function ( \alpha(x) ) to incorporate failure probability:
- ( \alpha_{modified}(x) = \alpha(x) \cdot (1 - p(fail|x)) )
- This naturally discourages sampling in high-risk regions.

Constrained Bayesian Optimization with Known Boundaries

When the boundaries between viable and failing parameter regions can be explicitly defined, constrained BO methods excel. These approaches directly incorporate known experimental constraints into the optimization process [9] [10].

Algorithmic Framework:

Constraint Specification: Define constraint functions ( c_j(x) \leq 0 ) that delineate feasible regions.
Feasibility Modeling: Model each constraint function with a separate GP surrogate.
Feasible Acquisition: Modify the acquisition function to favor points with high probability of satisfying all constraints:
- ( \alpha{feasible}(x) = \alpha(x) \cdot \prodj p(c_j(x) \leq 0) )

This approach is particularly valuable in chemistry and materials science where physical laws or synthetic accessibility constraints can be formally encoded [9].

Table 2: Comparison of Advanced Methods for Handling Missing Data in BO

Method	Key Mechanism	Advantages	Limitations
Floor Padding Trick	Imputes failures with worst observed value	Simple, automatic, requires no tuning	May over-penalize near feasible boundaries
Binary Classifier	Predicts failure probability explicitly	Actively avoids failure regions	Requires sufficient data to train classifier
Constrained BO	Incorporates known constraint functions	Optimal for problems with defined boundaries	Requires explicit constraint formulation

Experimental Protocols and Validation

Protocol: Bayesian Optimization with Floor Padding for Materials Growth

This protocol adapts the methodology successfully employed in optimizing the growth of SrRuO₃ thin films via molecular beam epitaxy (MBE), which achieved record residual resistivity ratio in only 35 growth runs [1].

Materials and Equipment:

Molecular beam epitaxy system
SrRuO₃ source materials
Appropriate substrates
Structural characterization (X-ray diffraction)
Electrical transport measurement system

Procedure:

Define Parameter Space: Identify critical growth parameters (e.g., temperature, flux ratios, pressure) and their feasible ranges.
Initialize with Random Sampling: Conduct 5-10 initial growth experiments with randomly selected parameters to establish baseline.
Implement BO Loop: a. Characterize each grown film and calculate evaluation metric (e.g., residual resistivity ratio). b. For failed growths (no film formation, wrong phase), apply floor padding:
- Identify worst successful RRR value from current dataset.
- Assign this value to the failed experiment. c. Update GP surrogate model with completed dataset (successful values + padded failures). d. Use Expected Improvement acquisition function to select next growth parameters. e. Repeat until convergence or resource exhaustion.
Validation: Characterize optimal material properties against literature standards.

Protocol: BO with Failure Classification for Neuromodulation Parameters

This protocol addresses the challenges of optimizing neuromodulation parameters where effect sizes are small and safety constraints are critical, as demonstrated in deep brain stimulation studies [4].

Materials and Equipment:

Neuromodulation device with programmable parameters
Physiological and behavioral monitoring equipment
Safety monitoring systems

Procedure:

Parameter Definition: Define stimulation parameters (amplitude, frequency, pulse width) and their safe operating ranges.
Outcome Measures: Establish primary outcome metric (e.g., reaction time improvement) and safety thresholds.
Dual-Model Implementation: a. For each parameter set, administer stimulation and measure outcomes. b. Record binary success (measurable effect within safety bounds) or failure (no effect or adverse events). c. Train GP classifier on entire dataset to predict failure probability. d. Train GP regressor only on successful trials to model outcome function. e. Implement modified acquisition function: ( \alpha_{modified}(x) = EI(x) \cdot (1 - p(fail|x)) ). f. Select next parameters maximizing modified acquisition function.
Safety Monitoring: Implement additional boundary avoidance techniques to prevent parameter selection near safety limits [4].

Visualization: Workflow Diagrams

Diagram 1: Bayesian Optimization Workflow with Experimental Failure Handling

Diagram 2: Experimental Failure Handling Methods

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 3: Key Research Reagents and Computational Tools for BO with Missing Data

Item	Function/Application	Implementation Notes
Gaussian Process Framework	Core surrogate model for objective function	Use Matern kernel for realistic experimental responses; implement in Python with GPyTorch or scikit-learn
Binary Classifier Model	Predicts probability of experimental failure	Gaussian Process classifier or Random Forest for mixed parameter types
Acquisition Functions	Balances exploration and exploitation	Expected Improvement (EI) or Upper Confidence Bound (UCB), modified for constraints
Constraint Handling Toolkit	Encodes known experimental boundaries	PHOENICS or GRYFFIN algorithms for chemistry applications [9]
Boundary Avoidance Methods	Prevents oversampling at parameter edges	Iterated Brownian-bridge kernel for low effect-size problems [4]

The challenge of missing data due to experimental failures represents a critical limitation of standard Bayesian optimization in practical scientific applications. The breakdown occurs fundamentally in the surrogate model's inability to distinguish between genuinely promising but unexplored regions and parameter spaces that lead to experimental failure. Through methodical approaches like the floor padding trick, binary failure classification, and constrained optimization, researchers can transform this limitation into a manageable aspect of experimental design. The protocols and methodologies presented here provide a roadmap for implementing these advanced BO techniques across diverse domains, from materials science to neuromodulation therapy development. By formally addressing the reality of experimental failures, these approaches enable more efficient resource utilization and accelerate scientific discovery in high-dimensional, constrained parameter spaces.

Autonomous experimentation represents a paradigm shift in materials science, leveraging machine learning to navigate high-dimensional parameter spaces efficiently. A critical challenge in this endeavor, particularly for molecular synthesis and materials growth, is the frequent occurrence of experimental failures. These are trials where targeted materials are not formed, yielding no useful property data and creating a "missing data" problem that can stall optimization pipelines [1]. Bayesian optimization (BO) has emerged as a powerful, sample-efficient approach for global optimization, but its standard implementations often fail in these real-world scenarios where a significant portion of experiments does not yield a quantifiable result [1] [11].

This application note details practical strategies for adapting BO to handle experimental failures, enabling robust optimization in the face of incomplete data. We present domain-specific case studies and detailed protocols that frame failure not as a setback, but as an informative guide for subsequent experimentation.

Case Study 1: Failure-Handling in Oxide Thin Film Synthesis

Experimental Background and Objective

The first case study involves the optimization of molecular beam epitaxy (MBE) growth parameters for high-quality strontium ruthenate (SrRuO3) thin films. SrRuO3 is a metallic perovskite oxide critically used as an electrode in oxide electronics. The goal was to maximize the Residual Resistivity Ratio (RRR), an indicator of sample purity and crystallinity, by searching a wide three-dimensional parameter space. A key challenge was that many parameter combinations, being far from optimal, resulted in failed growth runs where the target phase did not form, leading to missing RRR data [1].

Bayesian Optimization Protocol with Failure Handling

1. Problem Formulation:

Input Parameters (x): A 3D vector representing key MBE growth conditions (e.g., substrate temperature, ruthenium flux, strontium flux).
Objective Function (S(x)): The measured RRR of the resulting SrRuO3 film. This function is unknown a priori.
Constraint: Experimental failure occurs when the designated phase is not formed, resulting in a missing measurement for S(x).

2. Algorithm: Bayesian Optimization with Floor Padding Trick The core innovation was the "floor padding trick" to handle missing data from failed experiments [1].

Surrogate Model: A Gaussian Process (GP) model is used to approximate the unknown objective function S(x) based on all previous successful observations.
Acquisition Function: An Expected Improvement (EI) function guides the selection of the next experiment by balancing exploration (high uncertainty) and exploitation (high predicted mean).
Failure Handling - Floor Padding: When a parameter set x_n leads to an experimental failure, instead of discarding the point, it is assigned the worst observed RRR value from all successful experiments conducted up to that point (y_n = min(y_i) for i < n). This complemented dataset is used to update the GP model for the next iteration.

3. Experimental Workflow:

Initialization: Start with a small set of initial growth runs based on domain knowledge or a space-filling design.
Iterative Loop:
- Characterization: Measure the RRR of the successfully grown film. If the growth failed, note the failure.
- Data Imputation: Apply the floor padding trick to any failed experiments from the last round.
- Model Update: Update the GP surrogate model with the complete dataset (successful measurements and padded failure points).
- Next Experiment Selection: Choose the next growth parameters x_n+1 by maximizing the acquisition function.
Termination: The process concludes after a fixed number of runs or when the RRR value converges.

The failure-handling BO algorithm successfully navigated the parameter space, avoiding regions that led to failed synthesis. In just 35 MBE growth runs, it discovered a SrRuO3 film with an RRR of 80.1, the highest value ever reported for a tensile-strained SrRuO3 film [1]. The floor padding trick was crucial for maintaining a stable search trajectory despite a substantial rate of experimental failure.

Table 1: Key Experimental Data from SrRuO3 Thin Film Optimization

Metric	Result	Significance
Optimal RRR Achieved	80.1	Highest reported for tensile-strained SrRuO3 films [1]
Total Number of Growth Runs	35	Demonstrates high sample efficiency
Search Space Dimensionality	3-dimensional	Includes substrate temperature, Ru flux, Sr flux
Core Failure-Handling Method	Floor Padding Trick	Enabled efficient search despite missing data

Workflow Visualization

The following diagram illustrates the closed-loop autonomous experimentation system that integrates the Bayesian optimization algorithm with material synthesis and characterization.

Case Study 2: Multi-Objective Optimization in Additive Manufacturing

Experimental Background and Objective

The second case study shifts focus to additive manufacturing (AM), where the goal was to optimize the printing of a test specimen using a syringe extrusion system. The challenge here was multi-objective optimization, aiming to simultaneously maximize the geometric similarity between the printed object and its target while also maximizing the homogeneity of the printed layers. This is a non-trivial problem as these objectives are often interdependent and competing [12].

Multi-Objective Bayesian Optimization (MOBO) Protocol

1. Problem Formulation:

Input Parameters (x): A set of 5 or more AM process parameters (e.g., print speed, extrusion pressure, nozzle height).
Objective Functions: Two functions, f1(x) (geometric accuracy) and f2(x) (layer homogeneity), to be maximized simultaneously.

2. Algorithm: Multi-Objective Bayesian Optimization with EHVI The optimization was performed using the Expected Hypervolume Improvement (EHVI) algorithm [12].

Surrogate Models: Separate GP models are built for each objective function (f1, f2) based on observed data.
Acquisition Function: EHVI calculates the expected improvement of a new point with respect to the entire Pareto front—the set of optimal solutions where no objective can be improved without worsening another. Maximizing EHVI expands the hypervolume dominated by the Pareto front.
Failure Consideration: While not explicitly detailed in the source, failures in this context (e.g., catastrophic print failure) could be integrated using the floor padding trick for each objective.

3. Experimental Workflow via AM-ARES: The Additive Manufacturing Autonomous Research System (AM-ARES) executes the following closed-loop workflow [12]:

Plan: The MOBO planner (EHVI) suggests a new set of print parameters based on the current Pareto front and surrogate models.
Experiment: AM-ARES uses the parameters to generate machine code and print the specimen. An onboard machine vision system captures an image.
Analyze: The system analyzes the image to compute scores for both geometric accuracy and layer homogeneity.
Update: The knowledge base is updated with the new parameters and their multi-objective scores. The MOBO planner is updated for the next iteration.

The MOBO approach successfully identified a set of optimal solutions (the Pareto front) that captured the trade-offs between geometric accuracy and layer homogeneity. This allowed researchers to select printer parameters based on their preferred balance of objectives, a significant advantage over single-objective optimization. The study demonstrated that autonomous experimentation could efficiently handle complex, multi-parameter optimization problems in additive manufacturing [12].

Table 2: Research Reagent Solutions for Autonomous Experimentation

Item / Reagent	Function in Experimental Protocol
Molecular Beam Epitaxy (MBE) System	High-precision thin film deposition tool for the SrRuO3 case study [1].
Strontium (Sr) and Ruthenium (Ru) Sources	Metallic precursors for the synthesis of SrRuO3 perovskite films [1].
Syringe Extrusion System (AM-ARES)	Additive manufacturing tool for depositing materials layer-by-layer in the AM case study [12].
Machine Vision System	Integrated camera system for in-situ characterization of printed specimens, enabling automated analysis of geometry and homogeneity [12].
Gaussian Process (GP) Model	Core statistical model serving as the surrogate for the objective function(s) in Bayesian optimization [1] [12] [11].

Multi-Objective Optimization Visualization

The following diagram illustrates the core concept of multi-objective optimization and the Pareto front, which is central to the additive manufacturing case study.

The Scientist's Toolkit: Essential Methods and Software

Implementing failure-resilient Bayesian optimization requires a suite of computational and experimental tools. Below is a summary of key software packages identified in the research.

Table 3: Essential Software Tools for Bayesian Optimization

Software Package	Key Features	License	Ref.
BoTorch	Built on PyTorch, supports multi-objective and high-throughput BO.	MIT	[11]
Ax	Modular, adaptable framework for general-purpose optimization.	MIT	[11]
Dragonfly	Comprehensive package with multi-fidelity and constrained BO.	Apache	[11]
COMBO	Efficient for problems with multiple categorical parameters.	MIT	[11]

These case studies demonstrate that experimental failure is not a terminal obstacle but an integral part of the learning process in autonomous materials development. The floor padding trick provides a simple yet powerful data imputation method for handling failed syntheses in single-objective optimization, as proven by the rapid discovery of high-RRR SrRuO3 films. For more complex goals, Multi-Objective Bayesian Optimization (MOBO) techniques like EHVI can effectively manage trade-offs between competing objectives, as shown in the additive manufacturing workflow. By adopting these protocols and integrating them with robust autonomous research systems, scientists and engineers can significantly accelerate the development of new molecules and advanced materials, transforming failure from a roadblock into a guidepost.

Building Resilient Systems: Key Algorithms and Strategies for Handling Failures

Within the framework of advanced research into Bayesian optimization (BO) with experimental failure handling, the management of missing data presents a significant challenge. In high-throughput experimental domains, such as materials growth or drug development, experimental failures are not merely inconveniences; they are inherent sources of missing data that can critically impede the optimization process if not handled appropriately [1]. Traditional methods like listwise deletion or simple mean imputation can introduce bias or fail to utilize the informational value of a failure. The floor padding trick emerges as a simple, yet potent, heuristic designed to integrate these failures directly into the BO framework, thereby turning failed experiments from data liabilities into valuable algorithmic guides [1].

This technique is particularly crucial when searching wide, multi-dimensional parameter spaces where the optimal region is unknown a priori. Restricting the search to a small, "safe" space based on prior experience risks missing the global optimum. The floor padding trick enables a more aggressive and comprehensive search strategy by providing a principled way to learn from failure [1].

Theoretical Foundation and Mechanism

Definition and Principle

The floor padding trick is an imputation strategy for handling missing data resulting from experimental failures. Its core operation is straightforward: when an experiment for a parameter vector x_n fails to yield a measurable outcome, the missing evaluation y_n is imputed with the worst observed value recorded up to that point in the optimization run [1].

Formally, given a sequence of observations (x_1, y_1), ..., (x_{n-1}, y_{n-1}), if the experiment at x_n fails, the complemented value is: y_n = min{ y_i | 1 ≤ i < n }

This heuristic is founded on two key rationales:

Algorithmic Guidance: It provides the BO algorithm with a strong, negative signal that the parameter x_n is undesirable. This discourages the surrogate model from subsequently recommending parameters in the vicinity of x_n, thereby fulfilling the requirement to avoid regions of experimental failure [1].
Model Update: It ensures that the failure information is incorporated into the Gaussian Process (GP) surrogate model. This update improves the model's representation of the underlying response surface, making its predictions and uncertainty estimates more accurate across the entire parameter space [1].

Comparison with Alternative Failure-Handling Methods

The following table summarizes the floor padding trick against other common approaches to handling experimental failures in optimization.

Table 1: Comparison of Methods for Handling Experimental Failures in Bayesian Optimization

Method	Mechanism	Advantages	Limitations
Floor Padding Trick [1]	Imputes with the worst value observed so far.	Adaptive, requires no tuning; provides a strong signal to avoid failure regions; updates the surrogate model.	The negative signal's intensity is dependent on the history of observations.
Constant Padding [1]	Imputes with a pre-defined constant value (e.g., 0 or -1).	Simple to implement.	Performance is highly sensitive to the chosen constant; requires careful prior tuning.
Binary Classifier [1]	Uses a separate model (e.g., GP classifier) to predict the probability of failure.	Explicitly models failure regions, helping to avoid them.	Does not inherently update the evaluation surrogate model with failure information; often combined with padding.
Data Deletion	Simply discards the failed trial from the dataset.	Simple.	Wastes experimental resources; the model gains no knowledge from the failure.

Performance and Quantitative Evaluation

The efficacy of the floor padding trick has been demonstrated in both simulation studies and real-world experimental optimization. The key performance metric in such sequential learning tasks is the best evaluation value achieved as a function of the number of experimental observations. A method that reaches a high value with fewer observations is considered more sample-efficient.

In simulation studies using artificially constructed functions with embedded failure regions, the floor padding trick (denoted as method 'F') showed a rapid improvement in evaluation value in the early stages of the optimization process compared to other methods [1]. This indicates its high sample efficiency, a critical property when each observation corresponds to an expensive, time-consuming experiment like a materials growth run.

The table below quantifies the outcomes of a real-world application in materials science, optimizing the growth of SrRuO₃ thin films via Machine-Learning-Assisted Molecular Beam Epitaxy (ML-MBE).

Table 2: Experimental Outcomes from ML-MBE Optimization Using the Floor Padding Trick

Optimization Aspect	Outcome with Floor Padding Trick	Significance
Parameter Space Searched	Wide 3-dimensional space	Enabled exploration beyond empirically "safe" regions.
Total Growth Runs	35	Demonstrates high sample-efficiency.
Achieved Residual Resistivity Ratio (RRR)	80.1	The highest value ever reported among tensile-strained SrRuO₃ films.
Handling of Failed Growths	Successfully complemented and leveraged	Failures informed the model and guided the search away from unstable parameter regions.

Experimental Protocol: Implementation in Bayesian Optimization

This section provides a detailed, step-by-step protocol for integrating the floor padding trick into a standard Bayesian optimization loop. The example context is the optimization of a physical property (e.g., RRR) in a materials growth experiment.

Workflow and Signaling Logic

The following diagram illustrates the integrated Bayesian optimization workflow with the floor padding trick, highlighting the critical decision point at the experimental failure check.

Step-by-Step Protocol

Step 1: Initialization

Action: Collect a small initial dataset D_0 = {(x_1, y_1), ..., (x_k, y_k)} through a space-filling design (e.g., Latin Hypercube Sampling) or based on prior literature.
Reagents & Materials:
- Parameter Ranges: Define the minimum and maximum values for each growth parameter (e.g., temperature, flux ratios) to be optimized.
- Bayesian Optimization Software: Utilize a framework such as BoTorch, GPyOpt, or Ax.
- Experimental Apparatus: The automated or semi-automated system for conducting experiments (e.g., Molecular Beam Epitaxy system).

Step 2: Model Fitting

Action: Fit a Gaussian Process (GP) surrogate model to the current dataset D_n. The GP will model the mean and uncertainty of the evaluation function S(x) across the parameter space.
Protocol Notes: Standard kernel functions like the Matérn kernel are a robust default choice.

Step 3: Candidate Selection

Action: Using the GP posterior, maximize an acquisition function a(x) (e.g., Expected Improvement, Upper Confidence Bound) to select the next parameter set x_{n+1} to evaluate.
Protocol Notes: This step balances exploration (high uncertainty) and exploitation (high predicted mean).

Step 4: Experimental Execution and Failure Assessment

Action: Conduct the experiment at x_{n+1}.
Critical Check: Determine if the experiment was successful and produced a valid, quantifiable measurement y_{n+1}.
- Success Criteria: Must be predefined. For materials growth, this could be the successful formation of the target crystalline phase confirmed by in-situ characterization.
- Failure Criteria: Failure to form the target material, formation of an incorrect phase, or equipment error.

Step 5: Data Imputation via Floor Padding (upon Failure)

Action: If the experiment at x_{n+1} fails, apply the floor padding trick.
- Calculate the worst observed value so far: y_{floor} = min( y_i for all i ≤ n ).
- Set y_{n+1} = y_{floor}.
- Append (x_{n+1}, y_{floor}) to the dataset: D_{n+1} = D_n ∪ (x_{n+1}, y_{floor}).
Protocol Notes: This step is the core of the method. The worst value y_floor is recalculated after each iteration, making the heuristic adaptive.

Step 6: Iteration

Action: Return to Step 2 and repeat the loop until a convergence criterion is met (e.g., a performance threshold is reached, a maximum number of experiments is conducted, or the acquisition function value falls below a threshold).

The Scientist's Toolkit

The following table details key computational and experimental components required to implement the described protocol.

Table 3: Research Reagent Solutions for Bayesian Optimization with Floor Padding

Item	Function/Description	Example Tools / Values
Gaussian Process Model	Serves as the probabilistic surrogate model to approximate the unknown objective function and quantify uncertainty.	BoTorch (PyTorch), GPyOpt, scikit-learn's `GaussianProcessRegressor`.
Acquisition Function	Guides the search by quantifying the utility of evaluating a new point, balancing exploration and exploitation.	Expected Improvement (EI), Upper Confidence Bound (UCB).
Bayesian Optimization Framework	Provides a high-level API for managing the optimization loop, models, and data.	Ax, BoTorch, GPyOpt.
Experimental Failure Criteria	Predefined, quantifiable conditions that determine if an experimental run is considered a failure and triggers the floor padding trick.	In-situ reflection high-energy electron diffraction (RHEED) pattern loss; X-ray diffraction peak absence.
Floor Padding Function	The algorithmic component that computes the worst observed value and performs the imputation upon failure.	Custom script: `y_floor = current_data['y'].min()`.

Advanced Implementation: Synergy with a Binary Classifier

For enhanced performance, the floor padding trick can be combined with a binary classifier that predicts the probability of failure for a given parameter set. This hybrid approach, referred to as 'FB' in the literature, uses two models [1].

The logical relationship between these components is as follows:

Protocol for the FB Hybrid Method:

Action: In parallel to the GP regressor for the evaluation metric, fit a GP classifier (or another probabilistic classifier) to the binary success/failure data.
Action: During the candidate selection step (Step 3 in the main protocol), modify the acquisition function to heavily penalize or discard points that the classifier predicts with high probability will lead to failure.
Action: If a failure nonetheless occurs (indicating a classifier error), the floor padding trick is applied as before to update the GP regressor. The failure data point is also used to update the binary classifier.

This combined strategy actively avoids predicted failure regions while still learning from mispredicted failures, making the overall optimization process more robust and efficient [1].

Within the broader research on Bayesian optimization (BO) with experimental failure handling, a significant challenge is managing a priori unknown feasibility constraints. In scientific domains like drug development and materials science, many experiments fail due to reasons that cannot be perfectly predicted beforehand, such as failed syntheses, unstable compounds, or equipment issues. These failures create nonquantifiable, unrelaxable, hidden constraints on the experimental parameter space [2]. This application note details how binary classifiers can learn these unknown constraint functions on-the-fly, making autonomous scientific experimentation more efficient and robust by avoiding infeasible regions.

Core Concept: The Feasibility-Aware Bayesian Optimization Framework

The standard BO loop is extended by integrating a probabilistic classifier that actively learns the boundary between feasible and infeasible experimental conditions. The core problem is formalized as finding an optimum ( \mathbf{x}^* ) such that: [ \mathbf{x}^* = \underset{\mathbf{x} \in \mathcal{S}}{\text{argmin}} \; f(\mathbf{x}), \quad \text{where} \; \mathcal{S} = { \mathbf{x} \in \mathcal{X} \; | \; c(\mathbf{x}) = 1 } ] Here, ( c(\mathbf{x}) ) is a binary constraint function that returns 1 if an experiment at point ( \mathbf{x} ) is feasible (yielding a measurement of the objective ( f(\mathbf{x}) )) and 0 if it is infeasible (a failure) [2]. This function is initially unknown and is learned sequentially.

Table 1: Key Characteristics of Unknown Constraints in Experimental Optimization

Characteristic	Description	Experimental Example
Nonquantifiable	Only binary (pass/fail) information is available, not the degree of violation.	A synthesis either succeeds or fails; no intermediate "ease of synthesis" score is provided [2].
Unrelaxable	The constraint must be satisfied to obtain an objective function measurement.	A compound's bioactivity cannot be measured if its synthesis yields insufficient material [2].
Hidden	The constraint is not known to the researcher before the experimental campaign.	The precise stability region for a new perovskite material is unknown before experimentation [2].
Simulation	Evaluating the constraint involves a costly procedure (e.g., an attempted synthesis).	The "synthetic accessibility" constraint is evaluated via the costly process of attempted synthesis [2].

The Role of the Binary Classifier

A binary classifier, typically a Gaussian Process Classifier (GPC) or other probabilistic model, is trained on all data points evaluated so far. For each point ( \mathbf{x} ), it estimates the probability of feasibility, ( p(c(\mathbf{x}) = 1) ). This probability is then integrated into the acquisition function of the BO to balance the exploration/exploitation of the objective with the avoidance of likely-infeasible regions [2] [13].

Quantitative Benchmarking of Feasibility-Aware Strategies

Several acquisition functions have been proposed to handle unknown constraints. Benchmarks on synthetic and real-world problems reveal their relative performance.

Table 2: Performance Comparison of Feasibility-Aware BO Strategies

Strategy	Core Principle	Average Performance (Valid Exps.)	Convergence Speed	Best-Suited Scenario
Expected Feasibility	Explicitly maximizes the probability of being feasible and optimal.	High [2]	Fast [2]	General purpose, balanced risk.
Floor Padding Trick	Assigns the worst observed objective value to failed experiments.	Competitive [1]	Fast initial improvement [1]	Simple baseline; tasks with smaller infeasible regions [2] [1].
Binary Classifier (B)	Uses a separate classifier to predict and avoid failures.	Good [1]	Can be slower initially [1]	When active failure avoidance is a high priority.
Entropy-Based Search	Actively queries points with high uncertainty about feasibility.	High [13]	Efficient for learning boundaries [13]	For actively mapping complex constraint boundaries.

Experimental Protocols

The following protocols outline the implementation of feasibility-aware BO in scientific experimentation.

Protocol: Inverse Design of Materials with Stability Constraints

This protocol is adapted from the benchmark on hybrid organic-inorganic halide perovskite materials [2].

1. Objective Definition:

Primary Objective (( f )): Maximize a target property (e.g., photovoltaic efficiency).
Unknown Constraint (( c )): Material stability under specific conditions (feasible = stable, infeasible = unstable).

2. Initialization:

Design Space (( \mathcal{X} )): Define the multi-dimensional parameter space (e.g., composition ratios, processing temperatures).
Initial Dataset: Conduct a small, space-filling set of experiments (e.g., 5-10 points) to collect initial data on both property and stability.

3. Model Configuration:

Objective Surrogate: Gaussian Process (GP) regression model.
Constraint Surrogate: Variational Gaussian Process (VGP) classifier.
Acquisition Function: Expected Improvement with feasibility probability (EI-CF) or similar.

4. Autonomous Loop Execution:

Suggestion: Optimize the acquisition function to select the next most promising and likely feasible experiment.
Making/Measurement: Execute the experiment.
- IF the material is stable, measure its efficiency.
- IF the material is unstable, record it as a failure.
Update: Augment the dataset and update both the GP regressor and the VGP classifier.
Iteration: Repeat until a performance target is met or the budget is exhausted.

Protocol: Drug Discovery with Synthetic Accessibility Constraints

This protocol is adapted from the benchmark on designing BCR-Abl kinase inhibitors [2] and principles of handling class imbalance [14].

1. Objective Definition:

Primary Objective (( f )): Maximize inhibitory activity (e.g., pIC50).
Unknown Constraint (( c )): Synthetic accessibility (feasible = synthesizable, infeasible = failed synthesis).

2. Initialization:

Design Space (( \mathcal{X} )): A molecular space (e.g., defined by SELFIES or a chemical descriptor space).
Initial Dataset: Select a diverse set of molecules from a virtual library for initial synthesis attempts.

3. Model Configuration:

Objective Surrogate: GP regression on molecular fingerprints.
Constraint Surrogate: Bernoulli distribution-based GPC, potentially using a Complement Naive Bayes variant to handle the inherent imbalance between feasible and infeasible molecules [15] [14].
Acquisition Function: A feasibility-aware function like EI-CF.

4. Autonomous Loop Execution:

Suggestion: Propose a molecule predicted to be active and synthesizable.
Making/Measurement:
- Attempt the synthesis.
- IF synthesis is successful, proceed to measure bioactivity.
- IF synthesis fails, record the molecule as a failure.
Update: Update both surrogate models. The classifier learns from the failed synthesis to improve its prediction of synthetic accessibility.
Iteration: Continue the loop to discover highly active and synthesizable drug candidates.

The Scientist's Toolkit: Essential Research Reagents

This table lists key computational tools and algorithms required to implement the described framework.

Table 3: Key Research Reagents for Feasibility-Aware BO

Item / Algorithm	Function / Purpose	Example Implementation / Notes
Gaussian Process (GP) Regressor	Models the unknown objective function ( f(\mathbf{x}) ) and provides uncertainty estimates.	Use a Matérn kernel for modeling scientific functions.
Gaussian Process Classifier (GPC)	Models the unknown binary constraint function ( c(\mathbf{x}) ) and provides feasibility probabilities.	A Variational GPC is recommended for robustness [2].
Expected Improvement (EI)	Standard acquisition function for optimizing the objective.	Suggests points with high potential improvement.
Feasibility-Weighted EI (EI-CF)	Modifies EI to account for the probability of feasibility.	The workhorse acquisition function for constrained BO [2].
Atlas	Open-source Python library for Bayesian optimization.	Includes implementations of the strategies discussed here [2].
GRYFFIN	Another BO package with some constraint handling capabilities.	Can be used for known constraint problems [13].

Workflow and System Diagrams

The following diagrams illustrate the logical workflow of the feasibility-aware BO system and the integration of the classifier.

Feasibility-Aware BO Workflow

Classifier Integration Logic

In the broader context of research on Bayesian optimization (BO) with experimental failure handling, feasibility-aware acquisition functions represent a critical advancement. These functions transform BO from a mere optimizer into a robust decision-making framework for autonomous experimentation. They address a pervasive challenge in scientific domains, including drug development: many optimization algorithms fail to intelligently manage a priori unknown constraints, which lead to experimental failures and wasted resources [2].

Such unknown constraints are ubiquitous, stemming from failed syntheses, unstable molecules, or equipment failures in chemical processes [2] [16]. In materials science, a target material phase may not form, preventing property measurement [1]. In drug discovery, a molecule might be synthetically inaccessible, halting its progression [2]. Naive BO strategies, which only optimize for performance, sample these infeasible regions repeatedly, depleting precious experimental budgets.

Feasibility-aware acquisition functions tackle this by integrating an online-learned probabilistic model of the constraint function directly into the sampling decision. They systematically balance the pursuit of high performance with the avoidance of likely constraint violations, enabling more efficient navigation of complex, constrained experimental landscapes. This document details their quantitative comparison, provides protocols for their implementation, and contextualizes their role within a modern experimental workflow.

Quantitative Analysis of Feasibility-Aware Methods

A comparative analysis of different strategies reveals significant performance variations. The following table synthesizes findings from benchmark studies on synthetic and real-world problems, highlighting the efficiency of different approaches in handling unknown constraints.

Table 1: Comparison of Strategies for Handling Unknown Constraints in Bayesian Optimization

Strategy Category	Specific Method	Key Mechanism	Average Performance (vs. Naive)	Best-Suited Constraint Scenarios
Naive (Baseline)	Constant Penalty [1]	Assigns a fixed, poor objective value (e.g., 0 or -1) to failures.	Highly sensitive to penalty choice; can be suboptimal or lead to slow improvement.	Small, easily avoided infeasible regions.
Adaptive Imputation	Floor Padding [1]	Assigns the worst observed objective value from successful experiments to failures.	Quick initial improvement; robust without parameter tuning; final performance can be slightly suboptimal.	General-purpose use when constraint boundaries are unknown.
Classification-Guided	Binary Classifier [1]	Uses a separate model (e.g., GP classifier) to predict failure probability and avoids those regions.	Slower initial improvement but better avoidance of failures; less sensitive to penalty value.	Problems where feasibility is a primary concern and can be learned.
Integrated Acquisition	Feasibility-Aware EI [2]	Multiplies standard Expected Improvement by the probability of feasibility.	On average, outperforms naive strategies, producing more valid experiments and finding optima at least as fast.	Most scenarios, especially with balanced risk.
Feasibility-Driven Search	FuRBO [17]	Uses inspector sampling and a trust region to aggressively guide the search toward feasible regions.	Ties or outperforms alternatives, with superior performance in high-dimensional problems where feasible regions are narrow and hard to find.	High-dimensional problems (Dozens of variables) with small, irregular feasible regions.

Key insights from benchmark studies include:

Balanced strategies outperform naive ones: Comprehensive testing, as in the Anubis benchmark, shows that feasibility-aware acquisition functions with balanced risk (like Feasibility-Aware EI) on average outperform commonly adopted naive strategies [2]. They produce a higher proportion of valid experiments and locate the global optimum at least as fast.
Scenario-dependent performance: In tasks where the region of infeasibility is small, naive strategies can be competitive. However, their performance deteriorates significantly as the complexity and size of the infeasible region grow [2].
The cost of initial learning: Methods that rely on learning a separate classifier, such as the Binary Classifier approach, can exhibit slower initial improvement as the model requires data to learn the constraint boundary [1].

Experimental Protocols

This section provides a detailed, actionable protocol for implementing a feasibility-aware BO campaign, using the discovery of a BCR-Abl kinase inhibitor with unknown synthetic accessibility constraints as a representative example [2].

Protocol: Drug Discovery with Unknown Synthetic Accessibility

1. Problem Formulation and Goal

Objective: To find a molecule ( x^* ) with high inhibitory activity (e.g., low IC50) against BCR-Abl kinase.
Unknown Constraint: The synthetic accessibility of proposed molecular candidates. A molecule ( x ) is infeasible if its synthesis fails or yield is insufficient for assay.
Success Metric: Find the molecule with the best activity within a fixed budget of 50-100 total experiments (including both successful and failed syntheses).

2. Experimental Setup and Reagent Solutions Table 2: Research Reagent Solutions for the Drug Discovery Benchmark

Reagent / Resource	Function in the Experiment
Virtual Chemical Library	A large set of purchasable or easily enumerable molecules (e.g., from ZINC database) serves as the search space.
Retrosynthesis Software	(e.g., ASKCOS, IBM RXN) Provides a coarse-grained proxy for evaluating synthetic difficulty during pre-screening (optional).
Automated Synthesis Platform	Enables high-throughput execution of chemical synthesis protocols for proposed molecules.
LC-MS / NMR Equipment	Used to confirm successful synthesis and purify the compound for biological testing.
Bioassay Kit for BCR-Abl Kinase	Measures the primary objective (e.g., IC50) of successfully synthesized molecules.

3. Bayesian Optimization Workflow

Step 1 - Initial Design: Select 5-10 initial molecules using a space-filling design (e.g., Latin Hypercube Sampling) or based on expert knowledge to ensure some diversity.
Step 2 - "Make" Phase: Attempt to synthesize the proposed molecule using the automated platform.
Step 3 - "Measure" and Classify:
- If synthesis is successful: Proceed to measure the IC50 value. Record ( y = \text{IC50} ) and feasibility ( c = 0 ) (feasible).
- If synthesis fails: Record no IC50 value. Record ( c = 1 ) (infeasible).
Step 4 - Model Update:
- Update a Gaussian Process (GP) regressor with all data points ( (x, y) ) where synthesis was successful.
- Update a Variational Gaussian Process (VGP) classifier with all data points ( (x, c) ), where ( c ) is the feasibility label (0 or 1). This model learns the probability of feasibility, ( p(\text{feasible} | x) ) [2].
Step 5 - Suggestion via Acquisition Function:
- Calculate a feasibility-aware acquisition function, such as Constrained Expected Improvement (CEI): ( \alpha{CEI}(x) = \alpha{EI}(x) \times p(\text{feasible} | x) ) where ( \alpha_{EI}(x) ) is the standard Expected Improvement from the GP regressor [2].
- Optimize ( \alpha{CEI}(x) ) over the chemical search space to propose the next molecule ( x{next} ) for synthesis.
Step 6 - Iterate: Return to Step 2 and repeat until the experimental budget is exhausted.

4. Critical Steps and Troubleshooting

Kernel Selection: Use a molecular kernel (e.g., Tanimoto kernel for fingerprints) for both the GP regressor and VGP classifier to capture similarity between molecules effectively.
Handling Imbalanced Data: Early in the optimization, failed experiments may dominate. Using a VGP classifier is more robust to such class imbalance compared to a standard GP classifier.
Validation: Run multiple optimization trials with different initializations to ensure the robustness of the result. The best-performing molecule from the final iteration should be re-synthesized and re-tested to confirm activity.

Workflow Visualization

The following diagram illustrates the core closed-loop workflow of a feasibility-aware Bayesian optimization, as described in the protocol.

Feasibility-Aware BO Workflow - This diagram shows the suggest-make-measure cycle that handles experimental failure. The key difference from a standard BO loop is the decision point after "Make," which routes failed experiments to update only the constraint model.

The Scientist's Toolkit

Table 3: Essential Software and Computational Tools for Feasibility-Aware BO

Tool / Library	Function	Key Feature for Feasibility
BoTorch [17]	A flexible library for Bayesian optimization research and implementation.	Provides built-in acquisition functions like Constrained Expected Improvement (qECI) for handling unknown constraints.
Atlas [2]	An open-source BO package designed for autonomous scientific experimentation.	Implements several feasibility-aware strategies, including the VGP classifier for constraint learning, as used in the Anubis benchmark.
Summit [16]	A Python toolkit for chemical reaction optimization and analysis.	Offers multiple optimization strategies, including TSEMO, which can be adapted for constraint handling in reaction spaces.
BioKernel [18]	A no-code BO framework designed for biological experiment optimization.	Features modular kernel architecture and heteroscedastic noise modeling, which are beneficial for modeling complex biological constraints.

Integrating feasibility-aware acquisition functions into Bayesian optimization frameworks is a cornerstone for developing robust and truly autonomous research systems. As benchmarks have demonstrated, moving beyond naive strategies like constant penalty allows researchers to navigate complex experimental landscapes with greater sample efficiency and a higher rate of return on investment [1] [2].

The continued development of algorithms like FuRBO for high-dimensional problems [17] and target-oriented BO for specific property values [19] shows the field's trajectory toward ever-more specialized and powerful optimization tools. For researchers in drug development and materials science, adopting these feasibility-aware protocols is no longer a speculative advantage but a necessary step in accelerating the pace of discovery while effectively managing the inherent risks and costs of experimental failure.

Model-based optimization strategies, particularly Bayesian optimization (BO), have become a cornerstone of autonomous scientific experimentation due to their sample efficiency and flexibility. When combined with automated laboratory equipment in a closed-loop system, they form the core of a self-driving laboratory (SDL), a next-generation technology for accelerating scientific discovery [20] [6]. A pervasive challenge in real-world scientific experimentation, especially in fields like chemistry, materials science, and drug development, is handling unexpected experimental failures. These failures arise from a priori unknown feasibility constraints in the parameter space, stemming from issues such as failed syntheses, unstable materials, unexpected equipment failures, or inaccessible drug dose combinations [20] [1] [21]. Traditional BO algorithms, which assume every suggested parameter combination can be evaluated, often perform poorly when a significant portion of the parameter space is infeasible. This application note details modern computational frameworks, led by Anubis, that are specifically designed to handle such unknown constraints, thereby advancing the reliability and efficiency of autonomous experimentation.

Several sophisticated frameworks have been developed to navigate the problem of unknown constraints and experimental failures. The table below summarizes the core approaches of three key implementations.

Table 1: Key Frameworks for Bayesian Optimization with Unknown Constraints

Framework Name	Core Problem Addressed	Primary Strategy	Reported Application Domain
Anubis [20] [6]	Unknown feasibility constraints	Learns a constraint function on-the-fly using a variational Gaussian process classifier, combined with feasibility-aware acquisition functions.	Materials design (e.g., perovskite stability), drug design (e.g., synthetic accessibility)
Floor Padding with Classifier [1]	Experimental failures in high-throughput materials growth	Imputes failed experiments with the worst observed value ("floor padding") and uses a binary classifier to predict failure probability.	Thin film growth via molecular beam epitaxy (MBE)
BATCHIE [21]	Intractable scale of combination drug screens	Uses Bayesian active learning (Probabilistic Diameter-based Active Learning) to design maximally informative sequential experiment batches.	Large-scale combination drug screening on cancer cell lines

Detailed Experimental Protocols

Protocol: Implementing the Anubis Framework for a Self-Driving Laboratory

The Anubis framework is designed for autonomous experimentation where feasibility constraints are unknown at the outset.

I. Research Reagent Solutions & Computational Tools

Table 2: Essential Components for an Anubis-driven SDL

Item Name	Function / Explanation	Example/Note
Automated Laboratory Hardware	Executes the "make" step of the closed loop; could be a synthesizer, printer, or bioreactor.	Critical for the "suggest-make-measure" cycle.
Characterization Tools	Executes the "measure" step; could be an HPLC, spectrometer, or scanner.	Provides the objective and constraint data.
Atlas Python Library	The open-source software library containing the Anubis implementation.	Hosts the feasibility-aware BO algorithms [20].
Variational Gaussian Process (GP) Classifier	The surrogate model that learns the probability of a parameter set being feasible from experimental data.	Models the unknown constraint function [6].
Gaussian Process Regressor	The standard surrogate model that learns the relationship between parameters and the primary objective.	Models the performance metric to be optimized.
Feasibility-Aware Acquisition Function	Balances high performance and feasibility when suggesting new experiments.	Examples: Expected Feasible Improvement [20].

II. Step-by-Step Methodology

Initial Experimental Design:
- Select an initial set of experiments (e.g., via Latin Hypercube Sampling) to gain a coarse understanding of the parameter space.
- The number of initial points should be a small multiple of the parameter space dimensionality.
Closed-Loop Experimentation Cycle:
- Suggest: The Anubis algorithm suggests the next experiment(s) by optimizing a feasibility-aware acquisition function. This function uses predictions from both the GP regressor (for performance) and the GP classifier (for feasibility).
- Make: The automated laboratory hardware executes the suggested experiment.
- Measure: The characterization tools measure the primary objective (e.g., material property, drug efficacy) and, crucially, record whether the experiment was a success or a failure (the feasibility label).
- Update: The dataset is updated with the new result. Both the GP regressor (with successful experiments) and the GP classifier (with all experiments, labeled as success/failure) are retrained on the expanded dataset.
Termination:
- The loop continues until a performance target is met, a computational budget is exhausted, or the model uncertainty is sufficiently reduced.

The following workflow diagram illustrates the core closed-loop process of the Anubis framework:

Protocol: Bayesian Optimization with Floor Padding for Materials Growth

This protocol is adapted from a method successfully used to optimize thin-film growth via Molecular Beam Epitaxy (MBE) [1].

I. Research Reagent Solutions & Computational Tools

High-Throughput Synthesis Platform: e.g., an automated MBE system.
Characterization Tool: e.g., a four-point probe for measuring electrical properties.
Gaussian Process Regressor: To model the objective function.
Binary Classifier: A separate model (e.g., another GP classifier) to predict the probability of experimental failure.

II. Step-by-Step Methodology

Initialize: Begin with a small set of initial experiments.
Evaluate and Label: For each experiment, measure the objective (e.g., Residual Resistivity Ratio). If the experiment fails to produce a measurable target material, label it as a failure.
Impute Missing Data (Floor Padding): When updating the GP regressor's dataset, assign the worst observed objective value from all successful experiments to any failed experiment. This actively discourages the algorithm from sampling near regions of failure.
Train Predictive Models:
- Train the GP regressor on the dataset containing both successful experiments (real values) and failed experiments (imputed values).
- Train a binary classifier on all experiments (success/failure labels) to model the probability of failure.
Suggest Next Experiment: The next parameter set is selected by optimizing a standard acquisition function (e.g., Expected Improvement), but the optimization is constrained to regions where the binary classifier predicts a low probability of failure.
Iterate: Repeat steps 2-5 until the optimization budget is consumed.

Table 3: Performance Comparison of Failure-Handling Strategies on a Simulated Benchmark [1]

Strategy	Binary Classifier	Initial Improvement	Final Performance	Robustness to Padding Choice
Floor Padding (F)	No	Fast	Suboptimal	High (automatic adaptation)
Constant Padding @-1	No	Slow	High	Low (requires tuning)
Constant Padding @0	No	Fast	Medium	Low (requires tuning)
Floor Padding + Classifier (FB)	Yes	Medium	High	High

Advanced Application: Large-Scale Combination Drug Screening with BATCHIE

The BATCHIE framework addresses the immense scale of combination drug screens, where the number of possible experiments (drug-dose-cell line combinations) is intractable [21].

I. Step-by-Step Methodology

Initial Batch Design: Use a space-filling design (e.g., Latin Hypercube) to select a diverse initial batch of combinations covering the drug and cell line space.
Model Training: Run the initial batch and use the results (e.g., cell viability measurements) to train a Bayesian model. BATCHIE uses a hierarchical Bayesian tensor factorization model that decomposes a combination's effect into individual drug effects and interaction terms.
Active Learning Loop:
- Simulate: Use the model's posterior distribution to simulate plausible outcomes for candidate experiments in the next batch.
- Score Informativeness: Calculate how much each candidate experiment is expected to reduce the model's posterior uncertainty using the Probabilistic Diameter-based Active Learning (PDBAL) criterion.
- Select Batch: Select the batch of experiments that, as a set, is maximally informative.
- Run and Update: Execute the new batch of experiments and update the Bayesian model with the results.
Validation: Once the budget is exhausted, use the final trained model to prioritize top combination hits for rigorous experimental validation.

The following diagram illustrates BATCHIE's adaptive screening workflow, which efficiently narrows down optimal combinations from a vast experimental space.

Performance and Outlook

Benchmarking studies demonstrate that feasibility-aware strategies consistently outperform naive approaches. The Anubis framework showed that on average, it produces more valid experiments and finds optima at least as fast as methods that do not properly handle constraints [20] [6]. In a prospective screen of a 206-drug library across 16 cancer cell lines, BATCHIE accurately predicted unseen combinations and detected synergies after exploring only 4% of the 1.4 million possible experiments, identifying a clinically relevant hit [21]. The "floor padding" method enabled the discovery of a high-quality SrRuO3 film in just 35 growth runs [1]. These frameworks, readily available in open-source libraries like Atlas and BATCHIE, are proving to be indispensable tools for making autonomous experimentation a practical and powerful reality across the natural sciences.

Drug discovery and development is a long, costly, and high-risk process, with 90% of clinical drug development failures occurring after entry into clinical trials [22]. Analyses of clinical trial data reveal that 40-50% of these failures stem from lack of clinical efficacy, while approximately 30% result from unmanageable toxicity [22]. This high failure rate persists despite implementation of successful strategies in target validation, high-throughput screening, and drug optimization, raising critical questions about whether certain aspects of target validation and drug optimization are being overlooked [22].

Current drug optimization paradigms heavily emphasize potency and specificity using structure-activity-relationship (SAR) but often overlook tissue exposure and selectivity using structure-tissue exposure/selectivity-relationship (STR) [22]. This imbalance can mislead drug candidate selection and negatively impact the balance of clinical dose, efficacy, and toxicity. Bayesian optimization (BO) with experimental failure handling represents a promising framework to address these challenges by efficiently navigating complex, multi-dimensional parameter spaces while learning from both successful experiments and failures.

Bayesian Optimization with Experimental Failure Handling

Core Principles of Failure-Tolerant BO

Bayesian optimization provides a sample-efficient approach for global optimization of expensive black-box functions, making it particularly suitable for drug candidate optimization where experimental resources are limited [1] [23]. The standard BO framework consists of two main components: a probabilistic surrogate model (typically Gaussian Processes) that models the objective function, and an acquisition function that guides the search by balancing exploration and exploitation [23].

The crucial innovation for drug development applications is the extension of BO to handle experimental failures - cases where certain parameter combinations (e.g., drug formulations, synthesis conditions) fail to produce viable candidates or measurable outcomes [1]. These failures represent missing data points that traditional BO algorithms cannot effectively utilize, limiting their efficiency in real-world drug optimization scenarios.

Technical Approaches for Failure Handling

Figure 1: Bayesian Optimization with Experimental Failure Handling

Two primary technical approaches have been developed for handling experimental failures in BO:

Floor Padding Trick

This method equalizes the evaluation value for missing data due to experimental failures to the worst evaluation value observed at that time [1]. When an experiment at parameter xₙ fails, the floor padding trick complements the missing yₙ value with min₁≤ᵢ<ₙ yᵢ. This approach provides the search algorithm with information that the attempted parameter worked negatively while being adaptive and automatic [1]. Unlike naïve constant padding, which requires careful tuning of the padding constant, the floor padding trick dynamically adjusts based on observed data.

Binary Classifier Integration

This approach employs a separate binary classifier to predict whether a given parameter will lead to experimental failure [1]. The classifier, typically based on Gaussian Processes, is trained on historical failure data and combined with the surrogate model for evaluation prediction. The binary classifier helps avoid subsequent failures but may not fully update the evaluation prediction model when employed as a distinct model [1].

Advanced Framework: Reasoning BO

Recent advances have integrated large language models (LLMs) with BO to create more robust optimization frameworks. Reasoning BO leverages LLMs' reasoning capabilities to guide the sampling process while incorporating multi-agent systems and knowledge graphs for online knowledge accumulation [23]. This framework addresses three fundamental limitations of traditional BO:

Ineffective utilization of domain-specific prior knowledge
Lack of interpretability in mathematical optimization
Weak cross-domain adaptability [23]

The framework operates through three core technical components: (1) reasoning-enhanced BO that incorporates natural language specifications and domain knowledge, (2) multi-agent knowledge management for dynamic information extraction and storage, and (3) post-training strategies for model enhancement [23].

Application to Drug Candidate Optimization

STAR: Structure-Tissue Exposure/Selectivity-Activity Relationship

The integration of failure-tolerant BO enables the implementation of a comprehensive Structure-Tissue Exposure/Selectivity-Activity Relationship (STAR) framework for drug optimization [22]. STAR classifies drug candidates based on three key parameters:

Table 1: STAR Drug Classification Framework

Class	Specificity/Potency	Tissue Exposure/Selectivity	Dose Requirement	Clinical Outcome
Class I	High	High	Low	Superior efficacy/safety with high success rate
Class II	High	Low	High	Moderate efficacy with high toxicity
Class III	Adequate	High	Low	Good efficacy with manageable toxicity
Class IV	Low	Low	High	Inadequate efficacy/safety - early termination

This classification provides a systematic approach to drug candidate selection that balances the critical factors of potency, tissue exposure, and selectivity - addressing the current overemphasis on potency/specificity alone [22].

Quantitative Performance Assessment

Table 2: Bayesian Optimization Performance Comparison

Method	Key Features	Application Results	Limitations
Standard BO with Floor Padding	Adaptive failure imputation using worst observed value	Quick initial improvement in optimization; 80.1 RRR in SrRuO3 films in 35 runs [1]	Final evaluation may be suboptimal compared to carefully tuned constants
BO with Binary Classifier	Predicts failure probability for parameters	Reduces sensitivity to padding constant choice [1]	Slower initial improvement; may not fully update evaluation model
Reasoning BO	LLM-guided sampling with knowledge graphs	60.7% yield in Direct Arylation vs 25.2% with traditional BO [23]	Potential hallucinations in LLM suggestions; computational complexity

Experimental Protocols

Protocol 1: Implementing Failure-Tolerant BO for Preclinical Optimization

Materials and Reagents

Table 3: Research Reagent Solutions for BO Implementation

Reagent/Resource	Function	Specifications
Gaussian Process Framework	Surrogate modeling	RBF kernel; zero prior mean function
Acquisition Function	Guide parameter selection	Expected Improvement (EI) or Upper Confidence Bound (UCB)
Binary Classifier Model	Predict experimental failure probability	Gaussian Process classifier or Random Forest
Knowledge Graph	Domain knowledge representation	Structured database of drug properties, toxicity data
Human Disease Models	Preclinical validation	Organoids, bioengineered tissue models, organs-on-chips [24]

Procedure

Initial Experimental Design
- Define multi-dimensional parameter space (e.g., chemical structure properties, formulation parameters)
- Select 5-10 initial points using Latin Hypercube Sampling for space-filling design
- Establish evaluation metrics (e.g., binding affinity, solubility, metabolic stability)
Iterative Optimization Loop
- Execute experiments at suggested parameters
- Classify outcomes as successful measurements or experimental failures
- Apply floor padding trick: impute failed experiments with min₁≤ᵢ<ₙ yᵢ
- Update binary classifier with failure/success data
- Retrain Gaussian Process surrogate model on augmented dataset
- Optimize acquisition function to select next parameters
- Integrate domain knowledge through knowledge graph queries
Termination Criteria
- Convergence criteria met (e.g., <1% improvement over 5 iterations)
- Maximum iteration count reached (typically 50-100 iterations)
- Identification of candidate meeting all target criteria

Data Analysis

Compute confidence intervals for optimal parameters using bootstrap resampling of surrogate model
Perform sensitivity analysis to identify critical parameters
Validate predicted optimal candidates through independent experiments

Protocol 2: STAR-Based Drug Candidate Classification

Figure 2: STAR-Based Drug Candidate Classification Workflow

Materials

In vitro assay systems for target binding affinity measurement
Tissue distribution study platforms (e.g., organ-on-chip models [24])
ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) screening assays
Analytical instruments for drug concentration quantification (LC-MS/MS)

Procedure

Potency/Specificity Profiling
- Determine IC₅₀/Kᵢ values for primary target
- Assess selectivity against related targets (e.g., kinase panels)
- Evaluate cellular potency in disease-relevant models
Tissue Exposure/Selectivity Assessment
- Measure tissue-plasma partition coefficients
- Determine accumulation in target vs. off-target tissues
- Assess penetration at disease sites using advanced disease models [24]
STAR Classification Implementation
- Calculate tissue selectivity indices (target vs. vital organs)
- Classify candidates according to STAR framework (Table 1)
- Apply machine learning models to predict clinical dose requirements

Data Analysis

Construct quantitative structure-property relationship (QSPR) models
Perform multivariate analysis to identify critical determinants of tissue selectivity
Develop physiologically-based pharmacokinetic (PBPK) models for human dose prediction

Implementation Considerations for Drug Development

Addressing Low Effect Size Challenges

Bayesian optimization in biomedical applications must address the challenge of low effect sizes typical in neuro-psychiatric outcome measures and other biological systems [4]. Standard BO methods may fail for effect sizes below Cohen's d of 0.3, primarily due to over-sampling of parameter space boundaries as variance becomes disproportionately large [4].

Mitigation strategies include:

Input warping to normalize parameter distributions
Boundary-avoiding Iterated Brownian-bridge kernel to reduce edge variance
Structured acquisition functions that account for measurement noise characteristics

Integration with Preclinical Models

The transition from animal models to human disease models represents a critical opportunity for failure-tolerant BO [24]. Bioengineered human disease models, including organoids, bioengineered tissue models, and organs-on-chips, offer improved clinical biomimicry and predictability [24]. BO can optimize parameters for these complex model systems while handling inevitable experimental failures through the described methodologies.

The implementation of failure-tolerant Bayesian optimization represents a paradigm shift in drug candidate optimization, addressing the critical challenge of experimental failures that plague conventional approaches. By integrating the floor padding trick, binary classifiers, and reasoning systems with domain knowledge, researchers can efficiently navigate complex, multi-dimensional parameter spaces while learning from both successes and failures. The STAR framework provides a systematic approach to balance potency, tissue exposure, and selectivity - addressing key factors in the persistent high failure rate of clinical drug development. As human disease models continue to advance, failure-tolerant BO offers a robust computational framework to accelerate the identification of viable drug candidates with optimal efficacy and safety profiles.

When Robust BO Fails: Diagnosing Pitfalls and Optimizing Performance

Within the framework of research on Bayesian optimization (BO) with experimental failure handling, understanding and mitigating model misspecification is paramount. Model misspecification occurs when the surrogate model or prior beliefs fundamentally misrepresent the underlying system under study. In high-stakes fields like drug development, where experiments are costly and failures are consequential, such misspecification can lead to the systematic selection of suboptimal experiments. This document details how these incorrect assumptions induce linear regret—a cumulative performance loss that grows linearly with the number of experiments—and provides protocols to diagnose, prevent, and overcome these perils.

The Misspecification-Regret Nexus: A Quantitative Analysis

Linear regret, denoted in its canonical form as ( R(T) = O(T) ) over ( T ) trials, signifies that the average performance gap does not diminish with experimentation. In clinical development, this translates to prolonged trials, increased patient exposure to inferior treatments, and substantial financial losses. The core mathematical breakdown reveals that misspecification introduces a persistent bias that cannot be averaged out, causing the optimization process to become trapped in a suboptimal region of the experimental design space [25].

The following table synthesizes key quantitative findings from analyses of misspecification in biological and clinical contexts.

Table 1: Quantitative Impacts of Model Misspecification in Experimental Settings

Experimental Context	Misspecification Source	Impact on Parameter Estimation	Performance Metric Degradation
Cell Proliferation Assay [26]	Assuming logistic growth (β=1) for a true Richards' growth process (β=2)	Strong, non-physiological dependence between growth rate ( r ) and initial cell density ( u_0 )	Inaccurate inference of physiological differences; precise but biased estimates
General BOED [25]	Incorrect surrogate model under covariate shift	Amplification of generalization error via error (de-)amplification	Linear growth of cumulative regret, ( R(T) \propto T )
Seamless Clinical Trial [27]	Overly simplistic hierarchical model for patient subgroups	Failure to identify heterogeneous treatment effects	Inefficient resource allocation, missed therapeutic signals

The phenomenon of error amplification under covariate shift has been identified as a critical contributor to this regret, distinct from the shift itself [25]. This explains why standard BO methods, which often rely on Gaussian Process (GP) priors with stationary kernels, fail in complex, non-stationary reward landscapes commonly encountered in biological systems [28].

Experimental Protocols for Diagnosing Misspecification

Protocol: Residual Analysis for Dynamic Model Calibration

This protocol is designed to detect misspecification in models calibrated to time-series data, such as cell growth or protein kinetic studies.

Model Calibration: Calibrate the candidate model (e.g., logistic growth, ( \frac{du}{dt} = ru(1-\frac{u}{K}) )) to the observed data ( u^{\text{obs}}(t) ) using a standard Bayesian inference approach. Assume independent additive Gaussian noise: ( u^{\text{obs}}(t) = u(t) + \sigma\varepsilon ), where ( \varepsilon \sim \mathcal{N}(0,1) ) [26].
Posterior Sampling: Draw a representative sample of parameter sets (e.g., ( {r, K, \sigma} )) from the posterior distribution using Markov Chain Monte Carlo (MCMC) sampling.
Residual Calculation: For each parameter sample, simulate the model to obtain the predicted trajectory ( u(t) ). Compute the residuals ( \epsilon(t) = u^{\text{obs}}(t) - u(t) ) across the time series.
Pattern Analysis: Plot the residuals against time and against the model-predicted value ( u(t) ). The presence of systematic patterns (e.g., runs of positive or negative residuals, curvatures) is a strong indicator of model misspecification, even if the overall ( R^2 ) appears high [26].

Protocol: Semi-Parametric Gaussian Process for Robust Inference

This methodology propagates uncertainty in model structure to parameter estimates, reducing bias from misspecified functional forms.

Problem Formulation: Identify a key term in the differential equation with uncertain parametric form. For generalized logistic growth, this is the crowding function: ( \frac{du}{dt} = r u f(u/K) ) [26].
GP Prior Placement: Place a Gaussian Process prior on the unknown function ( f(\cdot) ), encoding prior beliefs (e.g., ( f(0)=1, f(1)=0 )) through the mean function.
Joint Posterior Inference: Employ a Bayesian sampling framework (e.g., Hamiltonian Monte Carlo) to infer the joint posterior distribution of the structural parameters of interest (e.g., the low-density growth rate ( r )) and the non-parametric function ( f ).
Uncertainty Quantification: Analyze the posterior distribution of ( r ). A well-specified model will show a posterior that is robust to different initializations and prior assumptions on ( f ). Compare the credible intervals for ( r ) with those from a misspecified parametric model to assess the reduction in bias.

Visualization of Workflows and Relationships

Workflow for Misspecification-Aware Bayesian Optimization

The following diagram illustrates a robust BO workflow that integrates model checking and adaptation to mitigate linear regret.

Logical Relationship: Misspecification to Linear Regret

This diagram deconstructs the causal pathway from an incorrect prior to the outcome of linear regret.

The Scientist's Toolkit: Research Reagent Solutions

The following table catalogues key computational and statistical reagents essential for implementing the protocols and combating model misspecification.

Table 2: Essential Research Reagents for Misspecification-Robust Bayesian Optimization

Reagent / Tool	Function / Application	Relevance to Misspecification & Regret
Semi-Parametric Gaussian Process [26]	Replaces a potentially misspecified term in a differential equation with a non-parametric function.	Propagates structural uncertainty to parameter estimates, preventing overconfident and biased inference.
∞-Gaussian Process (∞-GP) [28]	A surrogate model that quantifies both value and model uncertainty via a spatial Dirichlet Process mixture.	Enables principled exploration in complex, non-stationary, heavy-tailed reward landscapes where classic GPs fail.
Bayesian Optimal Experimental Design (BOED) [25]	A paradigm for selecting maximally informative designs under constraints.	A novel acquisition function that considers representativeness can mitigate error amplification from covariate shift.
Hamiltonian Monte Carlo (HMC)	An MCMC method for efficiently sampling from high-dimensional posterior distributions.	Crucial for performing inference in complex models with semi-parametric components or hierarchical structures.
Residual Analysis & Posterior Predictive Checks [26]	A set of diagnostic procedures for comparing model predictions to actual data.	The primary method for detecting the presence and pattern of model misspecification after model calibration.

The peril of model misspecification presents a formidable challenge in scientific domains where experimentation is expensive and failures carry significant cost. The direct link between incorrect priors and linear regret underscores that statistical precision is not a substitute for model accuracy. By integrating the diagnostic protocols and robust modeling tools outlined in these application notes—such as semi-parametric Gaussian Processes and rigorous diagnostic checks—researchers can build more resilient Bayesian optimization systems. This approach is essential for advancing a research agenda in experimental failure handling, ultimately leading to more efficient and reliable scientific discovery in drug development and beyond.

Boundary oversampling represents a significant failure mode in Bayesian optimization (BO), a sample-efficient global optimization method widely used in applications with costly experimental evaluations, such as materials science and drug development [4] [29]. This phenomenon occurs when optimization algorithms disproportionately sample parameter space boundaries due to disproportionately high predictive variance in these regions compared to the interior space [4]. In practical applications involving experimental failure, such as failed materials synthesis or toxic drug compounds, this behavior leads to wasted resources and reduced optimization efficiency.

The problem is particularly pronounced in real-world applications where the underlying response surface exhibits low signal-to-noise ratio, a common characteristic in neurological, psychiatric, and biological measurements [4]. When effect sizes fall below a Cohen's d of 0.3, standard Bayesian optimization methods frequently fail to identify optimal parameters, primarily due to this boundary oversampling behavior [4]. Understanding and addressing this failure mode is therefore crucial for researchers applying BO to experimental domains with high noise or experimental failure rates.

Mechanisms and Impact of Boundary Oversampling

Underlying Causes

Boundary oversampling emerges from fundamental properties of Gaussian process (GP) models, the most common surrogate model used in BO. In regions with limited data points, such as parameter space boundaries, GP predictive variance naturally increases. Standard acquisition functions, which balance exploration (high uncertainty) and exploitation (high predicted value), become biased toward these high-variance boundary regions [4] [29].

The problem is exacerbated in high-dimensional spaces and when optimizing problems with complex safety constraints. As noted in industrial materials science applications, this behavior leads to suboptimal performance where "algorithms disproportionately sample parameter space boundaries, leading to suboptimal exploration" [29]. In essence, the algorithm becomes trapped in a cycle of sampling boundaries to reduce uncertainty rather than focusing on regions likely containing the true optimum.

Quantitative Impact on Optimization Performance

Table 1: Performance Degradation Due to Boundary Oversampling

Effect Size (Cohen's d)	Standard BO Success Rate	Primary Failure Manifestation	Typical Application Domains
> 0.5	High	Minimal boundary attraction	Robotics, materials synthesis
0.3 - 0.5	Moderate	Occasional boundary convergence	Pharmaceutical screening
< 0.3	Low	Consistent boundary oversampling	Neuromodulation, psychiatric drug development

Research demonstrates that for effect sizes below Cohen's d of 0.3, standard Bayesian optimization methods fail to consistently identify optimal parameters [4]. This performance degradation is particularly problematic in neuro-psychiatric applications where effect sizes are typically small but clinically meaningful, such as the DBS study demonstrating a highly significant (p < 1.33 × 10⁻¹⁷) but small effect (Cohen's d = 0.185) [4].

Mitigation Strategies and Protocols

Technical Approaches

Several technical solutions have demonstrated efficacy in addressing boundary oversampling:

Boundary-Avoiding Iterated Brownian-Bridge Kernel: This specialized kernel directly addresses the variance imbalance by reducing predictive variance at parameter space boundaries. Implementation results show robust BO performance for problems with effect sizes as low as Cohen's d of 0.1, significantly improving upon standard methods [4].
Input Warping: Transforming input parameters using warping functions can normalize variance distribution across the parameter space, preventing disproportionate uncertainty at boundaries [4].
Combined Kernel and Warping Approach: Using input warp transformation together with the boundary-avoiding kernel has demonstrated particularly strong performance, successfully addressing both the variance imbalance and the sampling bias [4].
Knowledge-Informed Feature Selection: Counterintuitively, industrial case studies revealed that incorporating excessive expert knowledge through additional features can exacerbate boundary issues by creating high-dimensional optimization problems. Strategic simplification of the feature space improved BO performance in recycled plastic compound development [29].

Experimental Protocol for Boundary Avoidance Implementation

Table 2: Protocol for Implementing Boundary Avoidance in Bayesian Optimization

Step	Procedure	Technical Specifications	Validation Metrics
1. Problem Assessment	Evaluate expected effect size and parameter space boundaries	Compute Cohen's d from pilot data or literature	Effect size > 0.3 enables standard BO
2. Kernel Selection	Implement boundary-avoiding Iterated Brownian-bridge kernel	Replace standard Matérn or RBF kernel	Reduced predictive variance at boundaries
3. Input Warping	Apply transformation to normalize parameter distributions	Use Beta cumulative distribution function	Uniform variance distribution across space
4. Acquisition Function Tuning	Adjust exploration-exploitation balance	Modify ξ parameter in EI or UCB	Balanced sampling between interior and boundaries
5. Validation	Compare sampling distribution with standard BO	Quantify boundary vs. interior sampling ratio	Significant reduction in boundary sampling

The following workflow diagram illustrates the complete experimental protocol for addressing boundary oversampling:

Handling Experimental Failures

Experimental failures represent a common challenge in drug development and materials science applications. When parameters lead to failed experiments (e.g., insoluble compounds, toxic reactions, or failed synthesis), specific handling strategies are required:

Floor Padding Trick: Assign the worst observed value to failed evaluations, providing the algorithm with negative feedback about unsuccessful parameters while maintaining model updating capability [1].
Binary Classifier Integration: Train a separate classifier to predict failure probability, allowing proactive avoidance of parameters likely to result in experimental failure [1].
Adaptive Boundary Adjustment: Dynamically adjust parameter space boundaries based on observed failures, effectively constraining the search space to regions with higher success probability [1].

Case Studies and Experimental Validation

Neuromodulation Optimization

In precision neuromodulation, where parameters such as stimulation amplitude and pulse width must be optimized for individual patients, boundary oversampling posed significant challenges. Standard BO methods failed to consistently identify optimal parameters for effect sizes below Cohen's d of 0.3, which represents the majority of applications in neurology and psychiatry [4].

Implementation of the boundary-avoiding Iterated Brownian-bridge kernel combined with input warping demonstrated robust performance even for effect sizes as low as 0.1, successfully addressing the boundary variance problem. This approach enabled reliable optimization of stimulation parameters despite substantial measurement noise characteristic of neural systems [4].

Materials Science Application

In industrial recycled plastic compound development, Bayesian optimization initially performed worse than traditional Design of Experiments methodologies due to boundary oversampling and other failure modes [29]. The compounding problem involved optimizing four raw material proportions to achieve target melt flow rate, Young's modulus, and impact strength values.

Analysis revealed that incorporating excessive expert knowledge through additional features transformed the optimization into a high-dimensional problem exacerbating boundary issues. By simplifying the problem formulation and addressing boundary oversampling specifically, researchers achieved satisfactory results, highlighting the importance of balanced feature engineering in practical BO applications [29].

Research Reagent Solutions

Table 3: Essential Research Materials for Boundary Oversampling Investigation

Reagent/Software	Specifications	Application Function
BoTorch Framework	Python library for Bayesian optimization	Implementation of surrogate models and acquisition functions
Ax Platform	Adaptive experimentation platform	End-to-end Bayesian optimization with failure handling
Gaussian Process Models	Probabilistic surrogate models	Response surface modeling with uncertainty quantification
Boundary-Avoiding Kernel	Iterated Brownian-bridge implementation	Reducing predictive variance at parameter space edges
Input Warping Functions	Beta cumulative distribution transforms	Normalizing variance distribution across parameter space

Implementation Guidelines

Diagnostic Protocol

Researchers should implement the following diagnostic protocol to identify boundary oversampling in their optimization problems:

Visualization of Sampling Patterns: Plot the distribution of sampled points across the parameter space, specifically checking for clustering near boundaries.
Variance Analysis: Compare predictive variance between boundary and interior regions using the GP model.
Effect Size Calculation: Compute Cohen's d from preliminary data to assess potential vulnerability to boundary oversampling.
Performance Benchmarking: Compare optimization progress between standard and boundary-aware methods.

The following diagnostic decision tree guides researchers through identifying and addressing boundary oversampling:

Integration with Experimental Failure Handling

For comprehensive experimental optimization, boundary oversampling mitigation should be integrated with broader experimental failure handling approaches:

Failure-Adaptive Kernels: Modify kernel structures to incorporate knowledge from experimental failures, reducing sampling in regions adjacent to failed parameters.
Constrained Optimization Formulations: Explicitly incorporate safety constraints and failure boundaries into the optimization problem.
Multi-Fidelity Modeling: Utilize low-fidelity screening assays to identify failure-prone regions before committing high-value resources.

This integrated approach ensures robust optimization performance despite the dual challenges of boundary oversampling and experimental failures common in drug development and materials science applications.

Bayesian Optimization (BO) is renowned for its sample efficiency in optimizing expensive-to-evaluate black-box functions, making it a powerful tool for applications from materials science to pharmaceutical development where experiments are costly and time-consuming [30]. A key to its efficiency is the intelligent use of a probabilistic surrogate model, typically a Gaussian Process, to balance exploration and exploitation during the search for an optimum [5]. The incorporation of expert knowledge and historical data into this surrogate model is intuitively appealing, as it promises to guide the optimization more effectively and accelerate convergence. However, this practice can inadvertently trigger the "curse of dimensionality," a phenomenon where the performance of algorithms severely degrades as the number of input dimensions increases.

This article explores the critical pitfall of introducing expert knowledge without careful consideration for the resulting dimensionality. Through a detailed case study and supporting evidence, we will illustrate how transforming a relatively low-dimensional problem into a high-dimensional one can impair BO's performance, making it worse than traditional experimental designs. We provide structured protocols and actionable strategies to help researchers diagnose, prevent, and mitigate these issues, ensuring that the integration of expert knowledge remains a boon rather than a burden.

Case Study: The Failed Optimization of a Plastic Compound

A compelling real-world example from industrial materials science perfectly encapsulates the core problem [5]. The goal was to optimize a compound made from four raw materials (virgin polypropylene, recycled plastics, a filler, and an impact modifier) to meet three target quality metrics: Melt Flow Rate (MFR), Young's modulus, and impact strength.

The Initial, Lower-Dimensional Problem

The fundamental mixture problem was defined by just four input parameters—the proportions of the four ingredients—subject to mixture constraints (summing to 100%) [5]. This constituted a manageable search space for BO.

To improve the model, engineers provided extensive historical data and data sheets. Features were generated based on the expected impact of each main component on the quality metrics. This well-intentioned act of incorporating expert knowledge inflated the problem from 4 dimensions to 11 dimensions [5]. The surrogate model was now tasked with learning in an 11-dimensional space, but was trained on only 50 historical instances. This combination of high dimensionality and limited data likely led to an inaccurate model that failed to guide the optimization effectively.

Quantitative Performance Failure

The BO approach, despite its theoretical advantages, was benchmarked against a traditional Design of Experiments (DoE) methodology conducted by experienced engineers. The results were stark [5]:

BO Performance: Failed to match the performance of the expert-led DoE.
Root Cause: The incorporation of expert knowledge through additional features transformed the optimization into a high-dimensional space, compromising BO's sample efficiency.

Table 1: Performance Comparison of DoE vs. Failed BO in Plastic Compound Case Study

Method	Number of Experiments	Dimensionality	Performance Outcome
Expert DoE	25 (in batches)	4	Successfully identified a feasible compound [5]
Initial BO with Expert Features	25 (in batches)	11	Worse than expert DoE; failed to efficiently converge [5]

This case underscores a critical lesson: additional knowledge is only beneficial if it does not unduly complicate the underlying optimization goal [5].

Protocols for Diagnosing and Mitigating Dimensionality Issues

When BO underperforms expectations, the following structured protocols can help identify if the curse of dimensionality is the culprit and provide pathways to resolution.

Protocol 1: Diagnostic Checklist for BO Failure

Use this checklist to assess the health of your BO run:

Dimensionality Assessment: Compare the number of input dimensions (d) to the number of observations (N). A high d/N ratio is a primary risk factor.
Model Validation: Check the predictive accuracy of the surrogate model on a held-out validation set. A model with high inaccuracy cannot guide the optimization effectively [5].
Visualization of the Acquisition Function: Plot the acquisition function over time. A "jittery" or noisy acquisition function that fails to stabilize suggests the model is uncertain and struggling to identify promising regions.
Boundary Sampling: Check if a disproportionate number of suggested samples lie on the boundaries of the parameter space. This is a known failure mode of BO in high dimensions [5].

Protocol 2: Dimensionality Mitigation Strategies

If diagnostics point to dimensionality issues, employ the following mitigation strategies:

Problem Simplification:
- Action: Critically re-evaluate the necessity of all input features. Can any be removed without significant loss of information? Revert to the original, lower-dimensional parameterization if possible.
- Case Study Application: The failed plastic compound optimization was ultimately fixed by abandoning the complex 11-dimensional model and returning to a simplified 4-dimensional problem, which then performed successfully [5].
Structured Dimensionality Reduction:
- Group Testing Bayesian Optimization (GTBO): This method identifies active variables by systematically testing groups of variables. It extends group testing theory to continuous ranges, effectively focusing the optimization on the most influential dimensions [31].
- SCORE Reparameterization: This technique uses a 1D reparameterization trick to break BO's curse of dimensionality, sustaining linear time complexity even in high-dimensional landscapes. It is effective for "needle-in-a-haystack" type problems [32].
Iterative Knowledge Incorporation:
- Action: Instead of incorporating all expert knowledge at the start, use an iterative approach. Begin with a simple model and allow the BO process itself to discover which parameters are important. Expert knowledge can then be used to interpret results or guide later stages.

Table 2: Comparison of Dimensionality Mitigation Techniques for BO

Technique	Underlying Principle	Applicable Context	Key Advantage
Problem Simplification	Manual feature selection to reduce parameter count	Problems with redundant or low-impact features	Simple, highly interpretable, directly reduces complexity [5]
Group Testing (GTBO)	Statistical group testing to identify active variables	High-dimensional problems with an axis-aligned subspace (few active variables) [31]	Systematically discovers active set; enhances problem understanding
SCORE Reparameterization	1D reparameterization of the high-dimensional space	Complex, high-dimensional landscapes where standard BO fails [32]	Fast, scalable; avoids high computational costs

The Scientist's Toolkit: Essential Reagents for Robust Bayesian Optimization

The following tools and concepts are essential for implementing BO that is resilient to the curse of dimensionality.

Table 3: Research Reagent Solutions for Bayesian Optimization

Item / Concept	Function in the BO Pipeline	Application Notes
Gaussian Process (GP)	Serves as the probabilistic surrogate model to emulate the objective function.	Performance degrades with high dimensions. Requires careful choice of kernel [5].
Expected Improvement (EI)	An acquisition function that recommends the next sample point by balancing exploration and exploitation.	A standard, effective choice. Can become noisy if the GP model is inaccurate [30].
Thompson Sampling (TS)	An alternative acquisition strategy based on sampling from the posterior of the GP.	Useful for batch optimization and can be more robust in some scenarios [30].
Ax/Botorch Frameworks	Flexible, open-source Python libraries for adaptive experimentation and BO.	Provide state-of-the-art implementations of GP models, acquisition functions, and optimization algorithms [5].
BayBE (Bayesian Back-End)	A specialized tool for designing and implementing BO in an industrial context.	Handles multi-objective optimization and experimental constraints [5].
Group Testing (GTBO)	A pre-optimization phase to identify active parameters.	Use when suspecting that only a subset of parameters drives the objective function [31].

Visualizing Workflows: From Problem to Solution

The following diagrams, generated with Graphviz using the specified color palette and contrast rules, illustrate the core concepts and protocols discussed.

The Dimensionality Trap in BO

Diagnostic & Mitigation Protocol

The curse of dimensionality presents a significant and often underestimated challenge in the practical application of Bayesian Optimization. The intuitive step of incorporating rich expert knowledge can be a double-edged sword, potentially transforming a tractable problem into an intractable one. As demonstrated in the plastic compound case study, this can lead to performance worse than traditional experimental design.

The path to robust BO lies in a disciplined, diagnostic-driven approach. Researchers must be vigilant about the ratio of dimensions to data points and be prepared to employ strategies ranging from simple problem simplification to advanced structured methods like group testing or reparameterization. By recognizing the "dimensionality trap" and arming themselves with the protocols and tools outlined in this article, scientists and engineers can harness the full, sample-efficient power of Bayesian Optimization without falling victim to its curses.

Bayesian optimization (BO) is a powerful, sample-efficient technique for the global optimization of expensive-to-evaluate black-box functions. Its application spans numerous scientific and industrial domains, including materials science, drug discovery, and neuromodulation [33] [1] [2]. However, standard BO methodologies, which predominantly rely on Gaussian processes (GPs) with stationary kernels, can exhibit significant performance degradation or outright failure when confronted with real-world experimental challenges. These challenges include complex, non-stationary systems, high noise levels leading to small effect sizes, and a priori unknown constraints that result in experimental failures [4] [1] [29].

This article details three advanced mitigation strategies designed to enhance the robustness and applicability of BO in such demanding environments. We focus on ProBO for leveraging complex probabilistic models, Boundary-Avoiding Kernels to prevent pathological over-exploration of parameter space edges, and Input Warping to handle non-stationary functions effectively. The subsequent sections provide a detailed exposition of each strategy's principles, protocols for implementation, and visual guides to their operational workflows.

ProBO: Versatile Bayesian Optimization with Probabilistic Programming

Core Concept and Application Scope

ProBO is a BO framework that generalizes the surrogate modeling process by leveraging the modeling flexibility of Probabilistic Programming Languages (PPLs). Unlike standard BO, which is largely restricted to Gaussian processes, ProBO allows a user to "drop in" any Bayesian model defined in an arbitrary PPL and use it directly for optimization [33] [34]. This is particularly valuable for capturing complex system characteristics such as intricate noise structures, multiple interrelated observation types, and hierarchical relationships, which are often difficult to model with standard GPs. By using a more accurate model of the system, ProBO can potentially reduce the number of expensive queries required to find the optimum.

The framework operates on an abstraction built on three standard PPL operations: inf(D) for performing inference on dataset D, post(s) for sampling from the posterior given a seed s, and gen(x, z, s) for sampling from the generative distribution of observations y given input x and latent variable z [33].

Experimental Protocol and Workflow

The following protocol outlines the steps for implementing and executing a ProBO optimization campaign.

Protocol 1: ProBO Implementation

Model Definition: Define a generative model for the system using a PPL of your choice (e.g., Pyro, Stan, NumPyro). The model's joint probability density is p(D, z) = p(z) * p(D|z), where z are latent variables and D is the dataset of input-output pairs.
Initialization: Select an initial set of points X₀ (e.g., via Latin Hypercube Sampling) and evaluate the expensive objective function to obtain dataset D₀.
Inference Step: Run the PPL's inference algorithm, inf(D), on the current dataset D to obtain a posterior representation, post.
Acquisition Optimization: a. For a set of candidate points, use multiple draws from post(s) and gen(x, z, s) to compute the empirical expectation of the acquisition function (e.g., Expected Improvement). b. Optimize the acquisition function to select the next query point: xₙ = argmaxₓ a(x).
Evaluation and Update: Query the expensive black-box function at xₙ to obtain yₙ. Augment the dataset: D = D ∪ {(xₙ, yₙ)}.
Iteration: Repeat steps 3-5 until a convergence criterion (e.g., budget exhaustion or performance plateau) is met.

The workflow of this protocol is visualized in the diagram below.

Key Research Reagents

Table 1: Essential Components for a ProBO Framework

Component	Function	Example Tools
Probabilistic Programming Language (PPL)	Provides the flexible backbone for model definition, prior specification, and automatic inference.	Pyro (Python), Stan (C++), NumPyro (Python)
Inference Algorithm	Computes the posterior distribution of the model's latent variables given observed data.	MCMC, SMC, Variational Inference
Acquisition Function	Balances exploration and exploitation to select the next query point.	Expected Improvement (EI), Upper Confidence Bound (UCB)
Optimizer for Acquisition Function	Solves the inner-loop optimization problem to find the point that maximizes the acquisition function.	L-BFGS, DIRECT, Multi-start Gradient Descent

Boundary-Avoiding Kernels for Noisy, Low-Effect-Size Problems

Core Concept and Application Scope

Standard BO is prone to a specific failure mode in high-noise environments with small effect sizes, such as those common in neuromodulation and psychiatric applications: it disproportionately oversamples the boundaries of the parameter space [4] [35]. This occurs because the GP surrogate's predictive variance is naturally larger near the boundaries, making these regions appear artificially attractive to an exploration-oriented acquisition function. In low signal-to-noise ratio (SNR) settings, this tendency is exacerbated and can cause the algorithm to converge to a local optimum on the boundary instead of the true, often interior, global optimum [4].

The mitigation is to replace the standard kernel (e.g., RBF) with a Boundary-Avoiding Kernel, such as the Iterated Brownian-bridge kernel. This kernel construction actively reduces the predictive variance near the boundaries, steering the optimization toward the interior of the search space and achieving robust performance even for problems with very low Cohen's d effect sizes (as low as 0.1) [4] [35].

Experimental Protocol and Workflow

Protocol 2: Implementing Boundary-Avoiding BO

Problem Assessment: Determine if the problem is susceptible to boundary oversampling. Key indicators are a low estimated effect size (e.g., Cohen's d < 0.3) and high observational noise [4].
Kernel Selection: Replace the standard stationary kernel in your GP surrogate model with a boundary-avoiding kernel. The Iterated Brownian-bridge kernel is a recommended choice.
Model Configuration: Construct the GP model using the new kernel. The rest of the BO setup (acquisition function, optimizer) remains unchanged.
Standard BO Loop: Run the standard BO iterative process (e.g., fit GP model, optimize acquisition function, evaluate candidate, update data).
Validation: Compare the optimization path with one from a standard kernel. A successful implementation will show less clustering of samples at the boundaries and better convergence to the true optimum.

The diagram below illustrates the comparative workflow and the critical step of kernel replacement.

Performance Data and Reagents

Table 2: Mitigation Performance for Different Effect Sizes [4]

Effect Size (Cohen's d)	Standard BO Performance	BO + Boundary-Avoiding Kernel
d < 0.3	Fails to consistently identify optimal parameters; high boundary oversampling.	Robust performance; reliably finds optimum.
d ≈ 0.1	Converges to local, suboptimal boundary solutions.	Achieves robust optimization.

Table 3: Reagents for Boundary-Avoiding BO

Component	Function	Notes
Iterated Brownian-bridge Kernel	Reduces predictive variance at parameter space boundaries to mitigate pathological oversampling.	A specific kernel design for this purpose.
GP Framework with Custom Kernels	Allows implementation and integration of non-standard kernels.	GPyTorch, scikit-learn (custom).
Effect Size Estimator	To pre-screen the problem and decide if mitigation is necessary.	Cohen's d from pilot data or literature.

Input Warping for Non-Stationary Functions

Core Concept and Application Scope

Many real-world functions exhibit non-stationarity, meaning their smoothness (length-scale) varies across the input space. For example, the performance of a machine learning model might be highly sensitive to hyperparameter changes in one region of the space and very flat in another. Standard GP kernels with a single, stationary length-scale struggle to model such functions, leading to poor surrogate fits and inefficient optimization [36].

Input Warping mitigates this by learning a bijective transformation that maps the original input space x to a warped space x'. The function is modeled as f(x) = f'(w(x)), where f' is a function that is stationary in the warped space. This allows a standard stationary kernel to be used effectively [36] [37]. The Kumaraswamy CDF is a popular choice for the warping function due to its flexibility, closed-form CDF, and differentiability, enabling gradient-based optimization of its parameters a and b jointly with the GP hyperparameters [37].

Experimental Protocol and Workflow

Protocol 3: BO with Input Warping

Define Warping Function: Select a parameterized, invertible warping function. The Kumaraswamy CDF, K_cdf(x) = 1 - (1 - x^a)^b, is a suitable default choice for inputs in [0, 1].
Construct the Surrogate Model: Build a GP model that incorporates the warping transformation as an input transform. The kernel applied is a standard stationary kernel (e.g., RBF) operating on the warped inputs.
Define Priors and MLL: Place appropriate priors on the warping parameters (e.g., Log-Normal with a median of 1.0). The model's likelihood is then the Marginal Log Likelihood (MLL).
Joint Optimization: For each update of the GP model, optimize all parameters (kernel hyperparameters and warping function parameters a, b) jointly by maximizing the MLL.
Acquisition and BO Loop: Use the trained warped GP to construct an acquisition function (e.g., qLogNEI). Optimize the acquisition function in the original input space to select the next point. Continue the standard BO loop.

This process is summarized in the following workflow.

Key Research Reagents

Table 4: Essential Components for Input Warping

Component	Function	Example/Notes
Warping Function	Maps the raw, non-stationary input space to a warped, stationary space.	Kumaraswamy CDF, Beta CDF.
Stationary Kernel	Models the function in the warped space where stationarity holds.	RBF (SE), Matern.
Differentiable PPL/GP Framework	Enables joint gradient-based optimization of all parameters.	BoTorch (with Warp transform), GPyTorch.
Priors for Warp Parameters	Regularizes the warping function, preventing overfitting and ensuring identifiability.	LogNormalPrior(0.0, 0.75^0.5).

The mitigation strategies detailed herein—ProBO, Boundary-Avoiding Kernels, and Input Warping—significantly expand the frontier of problems addressable by Bayesian optimization. ProBO provides the flexibility to embed intricate domain knowledge and complex data relationships directly into the surrogate model. Boundary-Avoiding Kernels offer a targeted solution to a pervasive failure mode in high-noise, low-effect-size scenarios common in biomedical applications. Input Warping effectively handles the non-stationarity inherent in many engineering and machine learning tuning problems.

When integrated into a research workflow, these strategies transform BO from a powerful but sometimes brittle tool into a robust and versatile engine for autonomous scientific discovery. Their adoption is crucial for tackling the complex, noisy, and constrained optimization problems that define modern scientific challenges in fields from precision medicine to advanced materials design.

Benchmarks and Real-World Validation: Measuring What Works

Synthetic benchmarking provides a controlled, cost-effective framework for evaluating the performance of optimization algorithms before their deployment in real-world experimental campaigns. Within Bayesian optimization (BO) research, specifically in contexts involving experimental failure handling, synthetic benchmarks are indispensable for stress-testing algorithms against known challenges like a priori unknown constraints, without incurring the high costs of physical experiments [2]. By using carefully designed test surfaces, researchers can quantitatively compare how different BO strategies balance the exploration-exploitation trade-off while avoiding infeasible regions, leading to more robust and efficient autonomous scientific discovery systems [38] [2].

Key Synthetic Benchmarking Strategies

The design of a synthetic benchmark is critical for producing meaningful, generalizable conclusions about algorithm performance. The following strategies are foundational:

Pool-Based Active Learning Simulation: This approach uses existing experimental datasets to simulate an optimization campaign [38]. The complete dataset forms a discrete representation of the ground truth. The algorithm is initialized with a randomly selected subset of data points and then iteratively selects subsequent experiments from the remaining "pool." Its performance is evaluated based on how quickly it can identify the optimal conditions within this fixed set of possibilities, providing a realistic approximation of a resource-constrained experimental campaign [38].
Unknown Feasibility Constraints: To evaluate an algorithm's ability to handle experimental failures, benchmarks incorporate a priori unknown constraint functions [2]. These constraints are characterized as non-quantifiable, unrelaxable, simulation, and hidden. The algorithm must learn the feasible region on-the-fly using a probabilistic classifier (e.g., a variational Gaussian process classifier) and integrate these predictions with the surrogate model to guide the search away from infeasible areas [2].
Performance Metrics for Optimization: Standard metrics are used to facilitate direct comparison between algorithms. The acceleration factor measures how many fewer experiments an algorithm requires to find the optimum compared to a baseline like random sampling. The enhancement factor quantifies the improvement in the final objective value achieved by the algorithm relative to the baseline [38]. These metrics provide a clear, quantitative measure of an algorithm's sample efficiency and effectiveness.

Performance Benchmarks of Bayesian Optimization Methods

Extensive benchmarking across diverse synthetic and experimentally-derived datasets reveals critical insights into the performance of various BO algorithms, particularly when faced with unknown constraints.

Table 1: Benchmarking of Surrogate Models in Bayesian Optimization [38]

Surrogate Model	Key Characteristics	Performance Summary	Computational Considerations
Gaussian Process (GP) with Isotropic Kernel	Uses a single length-scale parameter for all input features.	Commonly used but often outperformed by more sophisticated models.	Simpler but less adaptable to complex, high-dimensional landscapes.
GP with Anisotropic Kernel (ARD)	Employs automatic relevance detection (ARD) with individual length-scales per feature.	Demonstrates superior robustness and performance; effectively identifies feature sensitivity.	Higher computational cost, but justified by performance gains.
Random Forest (RF)	Non-parametric, makes no distributional assumptions.	Performance is comparable to GP with ARD; a strong alternative.	Lower time complexity; requires less hyperparameter tuning effort.

Table 2: Performance of Feasibility-Aware Acquisition Functions for Unknown Constraints [2]

Strategy Type	Example	Mechanism	Performance Findings
Naïve Strategies	Simple Expected Improvement (EI)	Ignores constraint predictions; re-samples upon failure.	Competitive in tasks with small infeasible regions; inefficient otherwise.
Feasibility-Aware	Variants of EI, UCB, etc.	Integrates constraint probability to balance performance and feasibility.	On average, outperforms naïve strategies; produces more valid experiments and finds optima faster.
Balanced-Risk	Specific acquisition functions from research.	Balances sampling promising regions with avoiding predicted infeasibility.	Best average performance; effectively manages the exploration-feasibility trade-off.

Experimental Protocols for Synthetic Benchmarking

Protocol 1: Pool-Based Benchmarking of Surrogate Models

This protocol evaluates the core efficiency of different surrogate models within the BO framework using historical datasets [38].

Dataset Curation: Select a historical dataset from a completed experimental campaign (e.g., materials synthesis, drug design) containing input parameters and a corresponding objective value. The dataset should have 3-20 input features and dozens to hundreds of data points [38].
Benchmark Setup: Define the optimization objective (e.g., minimization) and the pool of available data points. Set the iteration limit for the optimization campaign.
Algorithm Initialization: Randomly select a small subset of data points (e.g., 5-10% of the pool) to form the initial training data for the surrogate model.
Iterative Optimization Loop: For each iteration: a. Model Training: Train the surrogate model (e.g., GP with ARD, Random Forest) on all data collected so far. b. Acquisition Maximization: Using an acquisition function (e.g., Expected Improvement), select the next data point from the pool that is predicted to be most valuable. c. "Measurement": Retrieve the target value for the selected point from the historical dataset and add it to the training data.
Performance Tracking: Record the best objective value found as a function of the number of iterations performed. Calculate the acceleration and enhancement factors relative to a random sampling baseline [38].
Analysis: Repeat the process with multiple random seeds to account for variability. Compare the performance curves of different surrogate model and acquisition function pairs.

Protocol 2: Evaluating Failure-Handling with Unknown Constraints

This protocol assesses an algorithm's capability to navigate an optimization landscape where some regions lead to experimental failure [2].

Test Surface & Constraint Definition: Select a synthetic benchmark function (e.g., from optimization literature) and define a feasible region within its domain using an explicit constraint function, c(x). Points where c(x) <= 0 are considered infeasible and return no objective value.
Algorithm Configuration: Configure the BO algorithm with two models: a regression surrogate (e.g., GP) for the objective and a classification surrogate (e.g., variational GP classifier) for the constraint function c(x) [2].
Feasibility-Aware Optimization Loop: For each iteration: a. Joint Modeling: Update the posterior of both the objective and constraint models with all previous observations. b. Informed Acquisition: Use a feasibility-aware acquisition function (e.g., Expected Improvement with constraint probability) to propose the next point, x_next [2]. c. Constraint Evaluation: Evaluate the constraint function at x_next. If feasible, evaluate the objective function; if not, record a failure and assign no objective value.
Metrics Calculation: Track the number of failed experiments and the best objective value found among feasible points over iterations. The primary metric is the number of iterations (or total cost) required to find the global feasible optimum.

Benchmarking BO with Unknown Constraints

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools for BO Benchmarking

Tool / Resource	Type	Function in Benchmarking
Gaussian Process (GP) with ARD	Surrogate Model	Models the objective function and infers feature sensitivity; robust for complex surfaces [38].
Random Forest (RF)	Surrogate Model	Provides a non-parametric alternative to GP; fast and effective for high-dimensional data [38].
Variational Gaussian Process Classifier	Constraint Model	Learns the unknown feasibility constraint from binary success/failure data during optimization [2].
Expected Improvement (EI)	Acquisition Function	Balances exploration and exploitation by prioritizing points with high expected improvement over the current best [38].
Atlas Python Library	Software Framework	An open-source BO package that implements strategies for handling known and unknown constraints [2].
Materials Datasets (e.g., P3HT/CNT, AgNP)	Benchmark Data	Experimental datasets used for pool-based benchmarking, providing realistic optimization landscapes [38].

Advanced Workflow: Integrating Reasoning and Knowledge Management

Emerging frameworks are enhancing BO by incorporating large language models (LLMs) for improved reasoning and knowledge retention, which is particularly valuable for complex domains like drug development [23].

Natural Language Problem Formulation: The user defines the experimental objective and search space using an "Experiment Compass" in natural language (e.g., "optimize reaction yield for direct arylation") [23].
Hybrid Suggestion Generation: The BO algorithm proposes candidate points, which are then evaluated by an LLM. The LLM leverages domain knowledge, historical data, and structured knowledge graphs to generate scientific hypotheses and assign confidence scores to each candidate, filtering out scientifically implausible suggestions [23].
Dynamic Knowledge Accumulation: A multi-agent system extracts structured notes and insights from each optimization cycle. These are stored in a knowledge graph and vector database, allowing the system to learn from past experiments and avoid repeating failures [23].
Informed Optimization: The refined and validated candidate points are evaluated, and the results are fed back into the BO loop, updating the surrogate models and the knowledge base simultaneously.

Reasoning-Enhanced BO Workflow

The optimization of experimental conditions is a cornerstone of scientific discovery and development, particularly in fields like materials science and drug discovery where experiments are costly and time-consuming. Bayesian optimization (BO) has emerged as a powerful, sample-efficient strategy for guiding these experiments. However, a pervasive challenge in real-world laboratory settings is the occurrence of experimental failures—runs where the target material cannot be synthesized, a molecule proves unstable, or equipment fails, resulting in missing data points for the objective function. Traditional BO approaches lack inherent mechanisms to handle these failures, which can severely hamper the optimization process. This Application Note provides a structured, evidence-based comparison of three distinct strategies for managing experimental failures within a BO framework: the Floor Padding Trick, Binary Classifiers, and Naive Strategies.

Our analysis, derived from recent benchmark studies, concludes that no single strategy is universally superior. The optimal choice is highly dependent on the specific experimental context, particularly the nature and extent of the failure-prone regions within the parameter space. For most scenarios involving unknown feasibility constraints, feasibility-aware BO using a binary classifier provides the most robust and sample-efficient performance. However, the Floor Padding trick offers a remarkably simple and effective alternative, especially in the early stages of an optimization campaign or when computational simplicity is desired. Naive strategies, while easy to implement, are generally not recommended due to their sensitivity and unpredictable performance.

Quantitative Performance Comparison

The following tables synthesize performance data from simulated and real-world benchmarks, comparing the key characteristics of each failure-handling method.

Table 1: Overall Strategy Comparison and Recommendations

Strategy	Core Mechanism	Key Advantages	Key Limitations	Ideal Use Case
Floor Padding	Assigns failed points the worst observed value [`citation:1`]	Simple, automatic, no tuning; quick early-stage improvement [`citation:1`]	Suboptimal final performance in some simulations; can be overly pessimistic [`citation:1`]	Initial wide-space exploration; rapid prototyping; low-computation environments
Binary Classifier	Uses a classifier (e.g., GP) to model failure probability and avoid infeasible regions [`citation:4]	Actively avoids failures; high sample efficiency; handles explicit constraints [`citation:4]	Increased model complexity; requires more computation per step [`citation:1]	Problems with large, complex infeasible regions; high-cost experiments
Naive (Constant Padding)	Assigns failed points a fixed, pre-defined low value [`citation:1]	Extremely simple to implement	Performance highly sensitive to chosen constant; requires expert tuning; can mislead search [`citation:1]	Not generally recommended; only for well-understood failure modes with obvious penalty values

Table 2: Simulated Performance Benchmarks

Strategy	Best-Found Evaluation (Circle Function)	Best-Found Evaluation (Hole Function)	Remarks on Performance
Floor Padding (F)	High (quick initial rise) [`citation:1]	High (quick initial rise) [`citation:1]	Excellent initial improvement, but may not reach the global optimum as efficiently as tuned methods [`citation:1]
Binary Classifier (B)	Slower initial rise [`citation:1]	Slower initial rise [`citation:1]	Suppresses sensitivity to padding value; focuses on feasible regions, potentially slowing early discovery [`citation:1]
Naive @-1	High final value [`citation:1]	Information missing	Can achieve good final results but is highly dependent on a correctly tuned penalty value [`citation:1]
Naive @0	Quick initial rise, poorer final value [`citation:1]	Information missing	Fast start but often plateaus at suboptimal levels due to misleading rewards [`citation:1]

Detailed Experimental Protocols

This section provides step-by-step protocols for implementing the core failure-handling strategies in a Bayesian optimization loop.

Protocol 1: Bayesian Optimization with the Floor Padding Trick

The Floor Padding method is an adaptive imputation strategy that integrates seamlessly into a standard BO workflow.

Workflow Diagram: Floor Padding Protocol

Step-by-Step Procedure:

Initialization: Begin with a small initial dataset of parameters and their corresponding evaluation values, ( D{n} = {(x1, y1), ..., (xn, y_n)} ).
Model Fitting: Train a Gaussian Process (GP) regression model on ( D_{n} ) to create a surrogate of the objective function.
Suggestion: Using an acquisition function (e.g., Expected Improvement) based on the GP, select the next parameter ( x_{n+1} ) to evaluate.
Experimental Execution & Failure Handling:
- Run the experiment with parameter ( x{n+1} ).
- If the experiment is successful: Measure the evaluation value ( y{n+1} ).
- If the experiment fails: Impute the value ( y{n+1} = \min(y1, ..., y_n) ). This automatically sets the failed point to the worst value observed so far in the campaign [citation:1].
Data Augmentation: Append the new data point ( (x{n+1}, y{n+1}) ) to the dataset, ( D{n+1} = Dn \cup {(x{n+1}, y{n+1})} ).
Iteration: Repeat steps 2-5 until a stopping criterion is met (e.g., budget exhaustion, performance convergence).

Protocol 2: Feasibility-Aware BO with a Binary Classifier

This method explicitly models the probability of failure using a classification model, allowing the algorithm to actively avoid infeasible regions.

Workflow Diagram: Binary Classifier Protocol

Step-by-Step Procedure:

Initialization: Start with an initial dataset that includes both successful and failed experiments. The regression dataset is ( D{reg} = {(xi, yi)} ) for successful points, and the classification dataset is ( D{class} = {(xj, cj)} ) where ( cj = 1 ) for success and ( cj = 0 ) for failure.
Model Fitting:
- Regression Model: Train a GP regression model exclusively on successful data ( D{reg} ) to model the objective function.
- Classification Model: Train a binary classifier (e.g., a Gaussian Process Classifier or Variational GP Classifier) on ( D{class} ) to model the probability of success, ( p(c=1 | x) ) [`citation:4].
Feasibility-Aware Suggestion: Use a feasibility-aware acquisition function that combines the regression and classification models. A common approach is to modify the standard Expected Improvement (EI): ( \alpha_{FEI}(x) = EI(x) * p(c=1 | x) ) This function favors points with high expected improvement that are also likely to be feasible [`citation:4].
Experimental Execution: Run the experiment with the suggested parameter ( x_{n+1} ).
Data Augmentation:
- If successful: Add ( (x{n+1}, y{n+1}) ) to ( D{reg} ) and ( (x{n+1}, 1) ) to ( D_{class} ).
- If failed: Add ( (x{n+1}, 0) ) to ( D{class} ). The regression dataset remains unchanged.
Iteration: Repeat steps 2-5 until convergence.

The Scientist's Toolkit: Essential Research Reagents & Computational Solutions

Successful implementation of these advanced BO strategies requires both software libraries and a conceptual understanding of key components.

Table 3: Key Research Reagents and Computational Solutions

Item / Solution	Type	Function / Application
Gaussian Process (GP) Regressor	Computational Model	Core surrogate model for approximating the unknown objective function; provides mean prediction and uncertainty estimate for any parameter set [`citation:1] [39].
Gaussian Process (GP) Classifier	Computational Model	A probabilistic classifier used to model the probability of experimental success/failure given parameters, crucial for the feasibility-aware approach [`citation:4].
Variational GP Classifier	Computational Model	A scalable variant of the GP classifier, often implemented in tools like GPyTorch, suitable for larger datasets [`citation:4].
Expected Improvement (EI)	Acquisition Function	A standard criterion for selecting the next experiment by balancing high mean performance (exploitation) and high uncertainty (exploration) [`citation:1].
Feasibility-Weighted EI	Acquisition Function	An augmented EI function that multiplies the standard EI by the predicted probability of success, directing the search away from likely failures [`citation:4].
Atlas Library	Software	An open-source Python library that includes implementations of feasibility-aware BO strategies, such as those benchmarked in the Anubis study [`citation:4].
Botorch / Ax	Software	PyTorch-based frameworks for Bayesian optimization, providing state-of-the-art GP models and acquisition functions for implementing these protocols [`citation:5].

The effective handling of experimental failure is not merely a technical detail but a critical factor in the practical success of autonomous and high-throughput experimentation campaigns. Based on the presented comparison:

For researchers beginning an optimization campaign in a new, poorly understood parameter space, the Floor Padding trick provides an excellent starting point due to its simplicity and effectiveness in early-stage learning.
For optimizing systems with known significant feasibility constraints (e.g., synthetic accessibility in drug design, stability regions in materials science), investing in a feasibility-aware BO strategy with a binary classifier is the most robust and sample-efficient long-term strategy [`citation:4].
Naive constant padding strategies should be avoided unless there is absolute certainty about the appropriate penalty value, as improper tuning can severely misdirect the optimization process [`citation:1].

The future of failure-aware optimization lies in the development of more integrated and adaptive algorithms. Promising directions include strategies that automatically learn the cost of failure and dynamically balance the exploration of uncertain regions with the avoidance of failures, further closing the gap between theoretical BO and the complex realities of scientific experimentation.

This application note details the implementation and validation of a machine-learning-assisted molecular beam epitaxy (ML-MBE) framework for optimizing the growth of high-quality SrRuO3 thin films. The core innovation lies in integrating Bayesian optimization (BO) with specific failure-handling techniques, enabling efficient navigation of complex growth parameter spaces while managing experimental failures. The methodology achieved record-high structural and electrical properties in tensile-strained SrRuO3 films within dramatically reduced experimental cycles, validating BO as a powerful tool for accelerating materials research and development.

The optimization of thin-film growth parameters presents a significant challenge due to high-dimensional, non-linear parameter spaces and the resource-intensive nature of experiments. Furthermore, experimental runs can often result in complete failure (e.g., no film formation or incorrect phase) where no meaningful evaluation data is obtained, creating a "missing data" problem that traditional optimization methods struggle to handle.

Bayesian optimization addresses this by building a probabilistic surrogate model (typically a Gaussian Process) of the objective function (e.g., film quality metric) based on past observations. It then uses an acquisition function to intelligently select the next experimental parameters by balancing exploration (probing uncertain regions) and exploitation (refining known promising regions) [1] [40].

The critical advancement documented here is the extension of BO to handle experimental failures. Two primary approaches were investigated and implemented:

The Floor Padding Trick: When an experiment fails, its evaluation value is imputed with the worst value observed so far (e.g., ( yn = \min{1 \leq i < n} y_i )). This simple, adaptive method informs the model that the parameter set was detrimental, guiding subsequent searches away from that region and updating the surrogate model without requiring manual tuning of a penalty value [1].
Binary Classifier of Failures: A separate classifier model predicts the probability that a given parameter set will lead to a failure. This model can be used to steer the acquisition function away from high-risk parameters [1].

Materials and Experimental Protocols

Research Reagent Solutions and Essential Materials

Table 1: Key materials and reagents used in the ML-MBE growth of SrRuO3 thin films.

Item	Function / Role in the Experiment
SrRuO3 Thin Films	Target material system; a conductive ferromagnetic oxide used as a model system for demonstrating the ML-MBE optimization protocol [41] [42].
Molecular Beam Epitaxy (MBE) System	Ultra-high vacuum deposition system used for the epitaxial, layer-by-layer growth of thin films with precise control over composition and structure [1] [40].
Sr, Ru Metallic Sources	Effusion cells or sources providing the elemental Sr and Ru beams for film growth. The Ru flux rate was a key optimized parameter [40].
Ozone (O3) Source	Reactive gas source for providing the oxygen species necessary for oxide film formation. The O3-nozzle-to-substrate distance was a key optimized parameter [40].
Single-crystal SrTiO3 Substrates	Substrates for the epitaxial growth of SrRuO3 thin films. The substrate temperature during growth was a key optimized parameter [40].

Detailed Growth and Optimization Methodology

1. System Setup and Initialization:

Equipment: Conduct growth in a customized MBE chamber equipped with effusion cells for Sr and Ru, an ozone source, and in-situ reflection high-energy electron diffraction (RHEED) [40].
Parameter Definition: Define the multi-dimensional parameter space to be optimized. For SrRuO3, the critical parameters were:
- Ru flux rate (a.u.)
- Substrate growth temperature (°C)
- O3-nozzle-to-substrate distance (mm) [1] [40]
Evaluation Metric: Define a quantitative metric for film quality. The Residual Resistivity Ratio (RRR), defined as ( \rho(300K) / \rho(0K) ), was used as the primary figure of merit, where a higher RRR indicates better crystalline quality and fewer defects [1] [40].

2. Bayesian Optimization Loop with Failure Handling: The core experimental protocol is an iterative loop, as visualized below.

Diagram 1: ML-MBE optimization workflow with failure handling.

Step 1: Initial Data Collection: Perform a small number (e.g., 5) of initial MBE growth runs with parameters chosen randomly or via a space-filling design across the defined parameter space [1].
Step 2: Evaluation and Failure Logging: For each growth run:
- If a coherent SrRuO3 film is formed, characterize it ex-situ to measure its RRR.
- If the growth fails (e.g., no film, wrong phase), log the parameters as a failure.
Step 3: Model Update:
- Update the primary Gaussian Process model. For successful experiments, use the measured RRR. For failures, use the floor padding trick (i.e., the worst RRR from previous successful runs) [1].
- Optionally, update a separate binary classifier that predicts the probability of failure for a given parameter set.
Step 4: Next-Parameter Selection: Using an acquisition function (e.g., Expected Improvement) that incorporates the GP's prediction and uncertainty—and can be modified by the failure classifier's output—select the parameter set for the next MBE run.
Step 5: Iteration: Repeat steps 2-4 until a convergence criterion is met (e.g., a target RRR is achieved, or a maximum number of runs is completed).

Key Experimental Results and Data

The ML-MBE approach was validated through both simulation and physical experimentation, demonstrating its efficacy and sample efficiency.

Simulation-Based Validation

Table 2: Summary of performance of different BO failure-handling methods on simulated data [1].

Method	Floor Padding (F)	Binary Classifier (B)	Key Finding on Simulated "Circle" Function
Baseline (@0)	No	No	Quick initial improvement, but sensitive to constant choice; suboptimal final performance.
Baseline (@-1)	No	No	Slower initial improvement, but better final performance than @0. Highly sensitive to choice of constant.
F	Yes	No	Quick initial improvement as good as @0, but adaptive and automatic. Final performance suboptimal to @-1.
B@0	No	Yes	Suppressed sensitivity to constant, but initial and final improvements were inferior.
B@-1	No	Yes	Suppressed sensitivity to constant, but initial and final improvements were inferior.
FB	Yes	Yes	Slower improvement; combination did not yield best performance in simulation.

Real-World ML-MBE Optimization of SrRuO3

The methodology was successfully applied to the growth of SrRuO3 films, a benchmark complex oxide material. Table 3: Quantitative results from the optimization of SrRuO3 thin films using ML-MBE [1] [40].

Optimized Parameter	Search Space	Key Outcome Metric	Result	Benchmark Context
Ru Flux Rate, Growth Temperature, O3-distance	3-dimensional parameter space	Residual Resistivity Ratio (RRR)	80.1	Highest reported among tensile-strained SrRuO3 films [1].
Ru Flux Rate, Growth Temperature, O3-distance	Single parameter optimized sequentially	Residual Resistivity Ratio (RRR)	> 50	High-quality film with strong perpendicular magnetic anisotropy [40].
Number of MBE Growth Runs	N/A	Experimental Efficiency (to achieve target)	35 runs (for RRR=80.1)	Drastic reduction compared to traditional iterative trial-and-error [1].
Number of MBE Growth Runs	N/A	Experimental Efficiency (to achieve target)	24 runs (for RRR>50)	Demonstrates sample efficiency of the BO approach [40].

Visualization of the Optimization Process

The following diagram illustrates the conceptual relationship between the surrogate model, the acquisition function, and the handling of failed experiments, which is central to this protocol.

Diagram 2: BO process integrating failure handling.

The real-world validation of ML-MBE for SrRuO3 thin film growth conclusively demonstrates that Bayesian optimization, particularly when enhanced with robust failure-handling methods like the floor padding trick, is a powerful paradigm for accelerating materials synthesis.

Efficiency: The achievement of record-setting film quality in only 24-35 experimental runs represents a dramatic reduction in time and resource investment compared to traditional methods [1] [40].
Robustness: The ability to search a wide, multi-dimensional parameter space without being derailed by experimental failures makes the process less dependent on prior intuition and more likely to discover truly optimal, non-intuitive growth conditions [1].
Practical Implementation: The floor padding trick proved to be a simple yet highly effective strategy, providing an adaptive penalty for failures without requiring manual tuning, making it a recommended first-choice approach for practitioners [1].

This protocol provides a validated blueprint for applying Bayesian optimization with experimental failure handling to the high-throughput development of complex materials, paving the way for fully autonomous materials synthesis platforms.

Within drug discovery, kinase inhibitors represent a critical class of therapeutics for oncology and other diseases. However, their development is often hampered by synthetic accessibility constraints, where promising candidate molecules are theoretically efficacious but practically impossible or prohibitively expensive to synthesize. This challenge aligns with a broader research thesis on Bayesian optimization (BO) with experimental failure handling, which focuses on developing algorithms that can intelligently navigate parameter spaces where unknown constraints cause experiments to fail. Traditional optimization methods often treat such failures as wasted trials, whereas advanced BO strategies learn from these failures to avoid infeasible regions proactively.

The Anubis framework provides a specialized BO approach to handle such a priori unknown constraints, using a variational Gaussian process classifier to learn the constraint function on-the-fly. This method balances sampling promising regions with avoiding areas predicted to be infeasible, significantly improving the efficiency of autonomous scientific experimentation [6] [20]. This application note details the protocol for applying this feasibility-aware BO to the design of BCR-Abl kinase inhibitors, a well-established oncology target, while incorporating critical synthetic chemistry constraints.

Key Research Reagent Solutions

The following table catalogues essential materials and computational tools required for implementing the described Bayesian optimization workflow for kinase inhibitor design.

Table 1: Essential Research Reagents and Tools for BO-Driven Inhibitor Design

Item Name	Function/Description	Example/Note
Atlas Python Library	Open-source platform implementing feasibility-aware BO strategies, including the Anubis framework.	Provides the core optimization logic and acquisition functions [6].
Bayesian Optimization Software	General-purpose BO frameworks for building surrogate models and calculating acquisition functions.	Frameworks supporting Gaussian Processes (GP) and Expected Improvement (EI) are essential [16].
Variational Gaussian Process Classifier	A specific type of surrogate model that learns and predicts the probability of synthetic feasibility.	Key component of the Anubis framework for modeling unknown constraints [6] [20].
Chemical Feature Descriptors	Numerical representations of molecular structures that serve as input for the objective and constraint models.	Examples include molecular weight, cLogP, topological torsion, and atom-pair fingerprints [16].
Synthetic Feasibility Oracle	A function (computational or expert-based) that evaluates whether a proposed molecule can be synthesized.	Used to provide "ground truth" data for training the constraint model in a closed loop [6].
High-Performance Computing (HPC) Cluster	Computational infrastructure for running complex Gaussian Process models and molecular simulations.	Accelerates the suggest-measure-analysis cycle in self-driving laboratories [6] [16].

Experimental Protocol & Workflow

This protocol outlines the steps for applying feasibility-aware Bayesian optimization to the design of synthetically accessible BCR-Abl kinase inhibitors.

Primary Objective: To identify BCR-Abl kinase inhibitor candidates that balance high potency (e.g., low IC50) with high synthetic accessibility.
BO Strategy: Use a dual-surrogate model system, with one GP regressor for the objective (e.g., binding affinity) and one variational GP classifier for the constraint (synthetic feasibility) [6].
Acquisition Function: Utilize a feasibility-aware acquisition function, such as Expected Improvement with constraint handling, to guide the selection of subsequent experiments.

Step-by-Step Procedure

Problem Formulation:
- Define Search Space (X): Establish the chemical space of interest. This could be a continuous space defined by molecular descriptors or a combinatorial space of molecular scaffolds and substituents.
- Define Objective Function (f(x)): Formulate the primary goal as a maximization problem. Example: predicted binding affinity (negative ΔG) or negative IC50 for BCR-Abl kinase.
- Define Constraint Function (c(x)): Formulate the synthetic accessibility constraint. A proposed molecule is considered feasible (c(x) = 1) if it can be synthesized in fewer than a defined number of steps or with commercially available starting materials; otherwise, it is infeasible (c(x) = 0).
Initial Experimental Design:
- Perform a small, space-filling initial set of experiments (e.g., 5-10 molecules) to seed the BO algorithm. This can be done via Latin Hypercube Sampling or by selecting a diverse set of known molecules.
- For each initial molecule, measure/calculate both the objective (f(x)) and the constraint (c(x)).
Model Initialization:
- Initialize a Gaussian Process (GP) regressor model using the initial data on the objective function.
- Initialize a Variational Gaussian Process (VGP) classifier using the initial data on the feasibility constraint [6].
Bayesian Optimization Loop: Iterate the following steps until a predefined budget (number of experiments) is exhausted or a performance target is met: a. Model Update: Update the GP regressor and VGP classifier with all available data. b. Acquisition Optimization: Identify the next candidate molecule (xnext) by optimizing the feasibility-aware acquisition function, which combines the GP's prediction for performance and the VGP classifier's probability of feasibility. c. Experiment Execution: "Measure" the candidate xnext. In a computational setting, this involves running simulations to compute f(xnext) and querying the synthetic feasibility oracle for c(xnext). In a physical self-driving laboratory, this would trigger automated synthesis and testing [6] [16]. d. Data Augmentation: Append the new data point {xnext, f(xnext), c(x_next)} to the dataset.
Termination and Analysis:
- After the loop terminates, analyze the collected data to identify the best-performing molecule that also satisfies the synthetic accessibility constraint (i.e., the best feasible solution).
- The final output is the Pareto frontier, if multi-objective optimization was performed, balancing potency, selectivity, and synthetic feasibility.

Workflow Visualization

The following diagram illustrates the logical flow and feedback loop of the described experimental protocol.

Quantitative Benchmarking Data

The performance of the Anubis framework was benchmarked against naive strategies (e.g., ignoring failures or simply re-sampling after a failure) in the context of kinase inhibitor design. The key quantitative results are summarized below.

Table 2: Benchmarking Results of Feasibility-Aware vs. Naive BO Strategies for Kinase Inhibitor Design [6]

Optimization Strategy	Key Characteristic	Average Performance: Valid Experiments Generated	Average Performance: Iterations to Find Optima
Anubis (Feasibility-Aware)	Actively learns and avoids infeasible regions using a VGP classifier.	Higher	As fast or faster than naive methods
Naive Strategy (Ignore Failure)	Proceeds with optimization, ignoring failed experiments.	Lower	Slower, due to wasted evaluations on infeasible candidates
Naive Strategy (Resample)	Re-samples randomly after a failure occurs.	Moderate	Competitive only in tasks with very small infeasible regions

The data demonstrates that feasibility-aware strategies, on average, outperform naive ones by producing more valid experiments and finding the optimal synthetically accessible inhibitors at least as fast [6]. This directly addresses the high failure rates in drug development, where 40-50% of failures are attributed to a lack of clinical efficacy, partly stemming from suboptimal candidate selection during preclinical optimization [22].

Integrating synthetic accessibility constraints directly into the Bayesian optimization workflow via the Anubis framework provides a robust and efficient methodology for drug candidate design. By learning from failed synthetic proposals, the algorithm systematically guides the search toward chemically tractable and potent kinase inhibitors. This approach significantly enhances the practicality and sample efficiency of autonomous experimentation in self-driving laboratories, mitigating a major risk factor in preclinical drug development and contributing substantially to the overarching thesis of developing robust BO systems capable of handling real-world experimental failures.

Bayesian optimization (BO) has established itself as a powerful, sample-efficient framework for guiding autonomous and high-throughput experiments in domains where function evaluations are expensive, such as materials science and drug development [38]. A crucial challenge in real-world experimental campaigns is the pervasive issue of experimental failures, where an attempted experiment does not yield a measurable result for the objective function due to synthesis failure, equipment error, or the formation of an undesired phase [1] [2]. These failures represent a priori unknown feasibility constraints, creating a dual objective for the BO algorithm: to find the global optimum of the expensive black-box function while simultaneously learning the boundaries of the feasible parameter space on-the-fly [2]. This application note provides a detailed analysis of the performance metrics used to evaluate BO algorithms adept at handling such experimental failures, focusing on convergence speed, sample efficiency, and failure avoidance. We further present structured protocols for implementing and benchmarking these algorithms in real-world experimental settings, with a particular emphasis on applications in scientific discovery.

Performance Metrics and Quantitative Analysis

Evaluating the performance of BO algorithms, especially those that handle failures, requires metrics that capture not just the final outcome but the efficiency of the search process. The table below summarizes the key performance metrics used in recent literature.

Table 1: Key Performance Metrics for Bayesian Optimization with Experimental Failure

Metric Category	Specific Metric	Definition and Interpretation	Key Findings from Literature
Convergence Speed	Best Objective vs. Iteration [1] [38]	The best-found objective value plotted as a function of the number of experimental iterations. A curve that rises quickly indicates fast convergence.	In materials growth, a method using the "floor padding trick" demonstrated quick initial improvement, reaching a high-performance material in only 35 growth runs [1].
Sample Efficiency	Acceleration/Enhancement Factor [38]	The performance of BO (e.g., best objective found after ( n ) iterations) compared to a baseline like random search. An acceleration factor >1 indicates BO requires fewer experiments.	Benchmarking across five materials datasets showed that BO with anisotropic Gaussian Processes or Random Forest surrogates consistently outperforms random sampling, providing significant acceleration [38].
Failure Avoidance	Number of Failures Incurred [2]	The total count of failed experiments during an optimization campaign. A lower number indicates better avoidance of infeasible regions.	Feasibility-aware BO strategies were shown to produce more valid experiments and find optima at least as fast as naïve approaches, while actively reducing the number of failures [2].
Overall Effectiveness	Valid Performance at Budget [2]	The best objective value achieved, considering only feasible experiments, within a given experimental budget (total number of runs).	For tasks with large infeasible regions, feasibility-aware strategies with balanced risk significantly outperform naïve strategies [2].

The choice of surrogate model within the BO framework significantly impacts these metrics. Studies have demonstrated that Gaussian Processes (GP) with anisotropic kernels (automatic relevance detection) and Random Forest (RF) models exhibit comparable and robust performance, both outperforming the commonly used GP with isotropic kernels [38]. While GP with anisotropic kernels is considered the most robust, RF is a compelling alternative due to its smaller time complexity and less sensitive hyperparameter tuning [38].

Experimental Protocols

Protocol 1: Handling Missing Data with the Floor Padding Trick

This protocol is designed for optimization campaigns where experiments can fail completely, yielding no objective function value.

Primary Citation: Wakabayashi et al., npj Computational Materials 8, 180 (2022) [1].
Principle: Experimental failures are complemented by imputing the worst observation value seen so far. This adaptive method provides the algorithm with negative feedback from failures without requiring pre-defined penalty constants.
Detailed Workflow:
- Initialization: Begin with an initial dataset ( \mathcal{D}1 = { (\mathbf{x}1, y1), \ldots, (\mathbf{x}n, y_n) } ).
- Sequential Experimentation: For iteration ( n = 1, 2, \ldots, N ): a. Update Surrogate Model: Train a surrogate model (e.g., Gaussian Process) on the current dataset ( \mathcal{D}n ). b. Maximize Acquisition Function: Propose the next experiment by finding ( \mathbf{x}{n+1} = \arg \max{\mathbf{x}} \text{EI}(\mathbf{x}) ), where EI is the Expected Improvement acquisition function. c. Run Experiment and Evaluate: Attempt the experiment at ( \mathbf{x}{n+1} ). - If the experiment is successful, record the measured objective ( y{n+1} ). - If the experiment fails, set ( y{n+1} = \min{y1, \ldots, yn} ) (the "floor" value). d. Augment Dataset: Update the dataset ( \mathcal{D}{n+1} = \mathcal{D}n \cup { (\mathbf{x}{n+1}, y{n+1}) } ).
Key Reagents and Solutions:
- Software: BO packages with custom acquisition function support (e.g., custom GPyTorch or BoTorch implementations).
- Data Logging System: A robust system to track both successful and failed experimental attempts, including the reason for failure if known.

Figure 1: Workflow for the Floor Padding Trick Protocol

Protocol 2: Feasibility-Aware Bayesian Optimization for Unknown Constraints

This protocol uses a classifier to explicitly model the probability of constraint violation, making it suitable for problems with large infeasible regions.

Primary Citation: Hickman et al., Digital Discovery, 2025, 4, 2104-2122 [2].
Principle: A variational Gaussian process classifier is trained on binary success/failure data to learn the unknown constraint function. Its predictions are integrated into a feasibility-aware acquisition function that balances seeking high performance with avoiding likely failures.
Detailed Workflow:
- Initialization: Start with an initial dataset containing parameters and their feasibility labels ( \mathcal{D}1 = { (\mathbf{x}1, y1, s1), \ldots } ), where ( s_i \in {\text{success}, \text{failure}} ).
- Sequential Experimentation: a. Update Surrogate Models: - Update the objective model (GP regression) using data from successful experiments. - Update the constraint model (GP classifier) using all feasibility labels. b. Construct Feasibility-Aware Acquisition Function: Use a function like Expected Improvement with Constraints (EIC) that incorporates the probability of feasibility ( p(\text{feasible} \mid \mathbf{x}) ): ( \text{EIC}(\mathbf{x}) = \text{EI}(\mathbf{x}) \times p(\text{feasible} \mid \mathbf{x}) ). c. Maximize Acquisition Function: Propose the next experiment ( \mathbf{x}{n+1} = \arg \max{\mathbf{x}} \text{EIC}(\mathbf{x}) ). d. Run Experiment and Classify: Execute the experiment at ( \mathbf{x}{n+1} ) and label it as a success (record ( y{n+1} )) or a failure. e. Augment Dataset: Update the dataset with the new observation and its feasibility label.
Key Reagents and Solutions:
- Software: The Atlas Python library implements the strategies described in this protocol [2].
- Feasibility Labeling System: A standardized process for consistently categorizing experimental outcomes as successes or failures.

Figure 2: Workflow for Feasibility-Aware BO Protocol

The Scientist's Toolkit

Implementing the above protocols requires a combination of software tools and methodological components.

Table 2: Essential Research Reagent Solutions for Failure-Aware BO

Item Name	Function/Purpose	Example Implementations & Notes
Gaussian Process Surrogate	Probabilistic model that serves as a surrogate for the expensive black-box objective function, providing mean and uncertainty predictions.	Kernels with Automatic Relevance Detection (ARD) are recommended for their robustness across diverse materials design spaces [38].
Feasibility-Aware Acquisition Function	Guides the selection of the next experiment by balancing the search for high performance with the avoidance of predicted failures.	Expected Improvement with Constraints (EIC) [2]. Other variants include Predictive Entropy Search with Constraints.
Constraint Model	A classifier that learns the boundary between feasible and infeasible regions of the parameter space from binary success/failure data.	Variational Gaussian Process Classifier, as implemented in the `Anubis` framework [2].
Convergence Monitor	Automatically determines when to terminate the optimization campaign based on the stability of the search process.	Exponentially Weighted Moving Average (EWMA) control charts applied to the Expected Log-normal Approximation of Improvement (ELAI) provide statistical convergence assessment [43].
Structured Sampling Strategy	Defines the initial set of experiments to ensure good coverage of the parameter space before sequential learning begins.	Latin Hypercube Sampling (LHS) and fractional factorial design can significantly enhance BO's initial performance and lead to more robust outcomes [44].

The integration of robust failure-handling mechanisms is no longer an optional enhancement but a core requirement for deploying Bayesian optimization in real-world experimental domains like materials science and drug development. The protocols and metrics detailed in this application note provide a framework for researchers to systematically evaluate and implement BO strategies that are both sample-efficient and resilient to experimental failures. The "floor padding trick" offers a simple yet powerful heuristic for incorporating failure feedback, while more sophisticated feasibility-aware methods explicitly model constraint boundaries for superior performance in complex search spaces. By adopting these advanced BO techniques, researchers can significantly accelerate their discovery cycles, minimize resource waste on failed experiments, and enhance the overall reliability of autonomous scientific platforms.

Conclusion

Effectively handling experimental failures is not merely an add-on but a fundamental requirement for deploying Bayesian optimization in real-world biomedical and clinical research. The key takeaway is that simple strategies like the floor padding trick can offer robust baselines, while more sophisticated, feasibility-aware methods that actively learn unknown constraints provide superior sample efficiency and safety for complex problems. The benchmarking studies consistently show that these advanced strategies outperform naive approaches, finding optimal conditions faster while conducting fewer invalid experiments. Looking forward, the integration of robust, failure-tolerant BO into self-driving laboratories promises to significantly accelerate the discovery of new therapeutic molecules and biomaterials. Future work should focus on developing even more sample-efficient algorithms for high-dimensional problems common in drug development and creating standardized frameworks for incorporating rich biological and chemical domain knowledge to preemptively avoid known failure regions, thereby making autonomous scientific discovery more reliable and impactful.