Cluster Dynamics ML¶
clusterdynamicsml is the predictive companion to clusterdynamics. It can be
launched directly from the clusterdynamicsml application entry point or from
the main SAXSShell UI. It
combines:
- time-binned cluster dynamics from extracted XYZ or PDB frame folders
- observed reference structure ensembles organized by stoichiometry label
- a lightweight regularized regression model for extrapolating larger clusters
- geometry statistics learned from the observed structures
The result is a ranked set of predicted larger-cluster candidates, predicted structure files, combined histogram views, and an optional SAXS mixture model that includes both observed and predicted structures.
What the application does¶
At a high level, Cluster Dynamics ML answers this question:
Given the smaller clusters that are observed in a trajectory and the reference structures available for those smaller clusters, what larger clusters are plausible, how much population should they carry, and what representative structures should they have?
It is not a black-box generative model. The current implementation is a feature-engineered predictive model with explicit physical constraints and geometry rules.
Inputs¶
Cluster Dynamics ML expects all of the inputs required by clusterdynamics,
plus a structure library for the observed smaller clusters.
Required inputs¶
- an extracted frames folder from
mdtrajectory - atom-type definitions that identify
node,linker, and optionalshellatoms - pair-cutoff definitions or a default cutoff
- a smaller-cluster structures folder organized by stoichiometry label
Optional inputs¶
- experimental SAXS data for fitting and comparing the predicted mixture model
- a CP2K
.enerfile for the lower subplot inherited fromclusterdynamics - an active SAXSShell project directory so datasets, reports, and history are saved into the project
- periodic boundary conditions, shell-growth levels, and shell-sharing options
Typical workflow¶
- Load the extracted XYZ or PDB frame folder.
- Set the atom-type definitions and pair cutoffs.
- Point the tool at the folder containing the observed smaller-cluster structure files.
- Set the target node-count range to extrapolate.
- Set the number of candidate stoichiometries to keep per target size.
- Set the predicted share threshold used to prune tiny candidate populations.
- Optionally load experimental SAXS data.
- Run Analyze and Predict Larger Clusters.
- Review the
Summary,Lifetimes,Debye-Waller,Histograms, andSAXStabs. - Save the dataset, CSV exports, or a detailed PowerPoint report if needed.
Training data assembled by the workflow¶
The workflow first runs the standard cluster-dynamics analysis and then joins that kinetic summary with the observed structure library.
For each observed stoichiometry label, the training row combines:
- stoichiometry and node count
- observed mean count per frame
- occupancy fraction
- association and dissociation rates
- completed and window-truncated lifetime counts
- mean and standard-deviation lifetime
- mean atom count
- mean radius of gyration
- mean maximum radius
- mean semiaxis lengths
- representative structure path and motif directories
These rows are represented internally by
ClusterDynamicsMLTrainingObservation.
What the model actually learns¶
The current implementation fits separate regularized linear models for each predicted scalar quantity. It does not use a neural network, graph neural network, random forest, or diffusion model.
Feature vector¶
For a candidate stoichiometry with node elements \(N\) and non-node elements \(X_1, X_2, \dots, X_k\), the feature vector is:
In code, this is the _candidate_feature_vector helper.
Properties predicted by regression¶
Separate models are fit for:
- mean count per frame
- occupancy fraction
- mean lifetime
- association rate
- dissociation rate
- radius of gyration
- maximum radius
- semiaxis
a - semiaxis
b - semiaxis
c - each non-node element count
Regression form¶
The models are weighted ridge regressions with a small diagonal penalty:
where:
- \(X\) is the feature matrix
- \(W\) is the diagonal matrix of per-observation stability weights
- \(\lambda = 10^{-6}\) in the current implementation
Several positive-valued targets are fit in log1p space and transformed back
after prediction so the model is smoother for counts, rates, radii, and
lifetimes.
Stability weighting¶
Training rows are not weighted equally. Each row receives a larger weight when it is supported by more structural examples, more completed lifetimes, larger mean count per frame, larger occupancy, and longer lifetime. This biases the fit toward better-supported observed clusters rather than treating all labels as equally reliable.
How candidate stoichiometries are generated¶
The workflow does not directly regress from one observed label to one predicted label. It first proposes a small set of candidate compositions for each target node count and then scores them.
Two candidate-generation routes are used:
Trend extrapolationUses the weighted average node-element fractions plus the non-node element count regressors to build a new composition from the target node count.Composition scaled from observed <label>Starts from each observed label and scales its non-node counts in proportion to the requested target node count.
The candidate list is then filtered by explicit support rules.
Stoichiometry support constraints¶
Candidates are removed when they violate any of the following checks:
- Required linker floors: If a linker element is present in every observed multi-node cluster, the larger predicted cluster must also include that linker.
- Pure-node support: A candidate with no non-node atoms is only kept when the training data show that pure-node clusters are already plausible at nearby sizes.
- Deduplication: Duplicate stoichiometries are merged after normalization.
These rules are why the workflow can reject obviously confusing candidates such as iodide-free larger clusters when all observed multi-node references include iodide.
Geometry statistics extracted from the observed structures¶
After the training rows are assembled, the workflow scans the observed structure files and learns empirical geometry summaries from them.
Quantities measured¶
- node-node bond lengths
- node-linker and node-shell bond lengths
- nearest-pair contact distances for all tracked element pairs
- contact distances grouped by atom type (
node,linker,shell) - node-centered bond angles
- node coordination medians by neighbor type
- non-node coordination to one or more node atoms
- non-node coordination medians to other non-node atom types
In practical terms, the learned statistics include the kinds of values users care about when judging the predicted structures:
- bond lengths
- coordination numbers
- bond angles
- relative atom positions around node atoms
- linker-linker distances
- linker-shell distances
- other non-node contact distances
How the predicted structure file is built¶
The output structure is generated in stages. It is not copied from a single template file, and it is not generated by directly sampling a force field.
1. Seed the node scaffold¶
If a representative observed structure is available, the workflow tries to reuse its node geometry as an initial seed. Otherwise it starts from a minimal node seed.
2. Grow the larger node network¶
Additional node atoms are placed one at a time using:
- the learned median node-node bond length
- node-node connectivity inferred from observed scaffolds
- node-centered angle preferences
- collision penalties that discourage unrealistic crowding
3. Place linker and shell atoms¶
Non-node atoms are placed after the node scaffold is built. Their placement order is determined by the learned coordination behavior:
- atoms that usually bridge multiple node atoms are placed first
- terminal atoms are attached afterward
For each atom, the workflow evaluates candidate positions using:
- target bond lengths to attached node atoms
- node-centered bond angles
- learned non-node contact distances
- learned non-node coordination counts
- penalties for short contacts and over-coordination
This is why linker-linker and linker-shell behavior now influences the final predicted structures instead of only satisfying node-centered coordination.
4. Preserve the geometry-guided local distances¶
The code still carries the predicted maximum radius as a learned descriptor, but the final global rescaling step is currently a no-op. In practice this means the output structure keeps the local bond lengths and angles generated by the geometry-guided placement stage instead of being stretched afterward.
5. Write predicted structure files¶
Each retained predicted candidate is written as its own XYZ structure file. The export includes node and non-node atoms that belong to the predicted cluster definition.
Debye scattering with pairwise Debye-Waller damping¶
Cluster Dynamics ML now distinguishes between two SAXS-component cases:
Averaged componentWhen a SAXS component is already averaged over many structure files, the averaging itself captures thermal disorder and motif variability.Single-structure componentWhen the component is computed directly from one XYZ or PDB structure, the trace is missing that ensemble broadening unless a disorder model is added explicitly.
This second case is exactly where the Debye-Waller-aware Debye equation is used. In practice, that means it applies to predicted-structure SAXS traces and to any observed component that must fall back to a single representative structure instead of an averaged project component.
Classical Debye equation¶
For atomic coordinates \(\mathbf{r}_1, \mathbf{r}_2, \dots, \mathbf{r}_N\), the current Debye scattering calculation is:
where:
- \(q\) is the magnitude of the scattering vector
- \(f_i(q)\) is the X-ray form factor of atom \(i\)
- \(r_{ij} = \lVert \mathbf{r}_i - \mathbf{r}_j \rVert\)
In the code this is implemented with the normalized sinc form, \(\operatorname{sinc}(q r_{ij} / \pi)\), which is mathematically equivalent to \(\sin(q r_{ij}) / (q r_{ij})\).
Debye-Waller-extended single-structure equation¶
For a single representative structure, Cluster Dynamics ML uses a pairwise Debye-Waller damping factor on the off-diagonal pair contributions:
where:
- \(\alpha(i)\) is the element of atom \(i\)
- \(\beta(j)\) is the element of atom \(j\)
- \(\sigma_{\alpha\beta}\) is the pair-specific thermal displacement parameter for the element pair \((\alpha, \beta)\)
The diagonal self-scattering terms \(i=j\) are left undamped. Only the interference terms between distinct atoms are attenuated.
Relation between \(\sigma\) and \(B\)¶
Cluster Dynamics ML reports both the Gaussian displacement width \(\sigma_{\alpha\beta}\) and the equivalent Debye-Waller \(B\) coefficient:
so the same damping factor can also be written as:
This is the form that will be useful later if these coefficients are exposed to main-model refinement.
How Debye-Waller coefficients are estimated¶
Cluster Dynamics ML estimates pairwise disorder from the observed structure ensembles before it predicts values for the larger clusters.
Observed-cluster ensemble estimate¶
For one observed stoichiometry label \(L\) and one element pair \((\alpha, \beta)\), each structure file \(s\) contributes all pair distances of that element type:
Those distances are sorted within each structure:
and aligned by rank across the ensemble up to the smallest available pair count
For each aligned rank \(k\), the workflow measures the ensemble spread:
and then aggregates those rankwise spreads into one label-level pair estimate:
Finally,
This gives one pairwise \(\sigma\) and \(B\) estimate per observed cluster type and per element pair type whenever the structure ensemble is large enough to measure a spread.
Predicted-cluster estimate¶
The larger predicted clusters do not have their own ensembles yet, so Cluster Dynamics ML fits a separate weighted ridge-regression model for each element pair type using the observed \(\sigma_{L,\alpha\beta}\) values as the training targets.
For a candidate feature vector \(\mathbf{x}\), the predicted disorder value is:
when the target is fit in log1p space, followed again by
The feature vector is the same one already used for the other Cluster Dynamics ML properties:
This keeps the Debye-Waller prediction consistent with the rest of the population, size, and geometry prediction workflow.
How predicted populations are assigned¶
Each candidate receives:
- a predicted mean count per frame
- a predicted occupancy fraction
- a predicted mean lifetime
- a derived stability score used for ranking
The predicted population share is normalized from these predicted quantities. If the direct SAXS-style weight collapses to zero, the code falls back to occupancy and then to lifetime divided by the frame timestep. This avoids a pathological zero-share result for a candidate that is still physically plausible.
When predicted structures are mixed with the observed structures, the total predicted mass is anchored to the observed size tail rather than letting a single extrapolated candidate dominate the whole distribution.
Outputs¶
Cluster Dynamics ML can produce:
- the standard time-binned colormap from
clusterdynamics - a combined lifetime table containing observed and predicted rows
- a
Debye-Wallertable listing the resolved \(\sigma\) and \(B\) values for each observed and predicted element pair - histogram views for observed-only and observed-plus-predicted populations
- SAXS traces for observed-only and observed-plus-predicted models
- one predicted structure file per retained predicted candidate
- reloadable JSON datasets
- CSV exports for the colormap and lifetime table
- a detailed PowerPoint report
If prediction history is enabled and a project folder is active, each run is also cached in the project so different parameter settings can be compared later.
Parameters that directly influence the prediction¶
The most important user-controlled parameters are:
Clusters folderThe observed reference structures that define the training ensemble.Predict from node count/Predict through node countThe target size range for extrapolation.Candidates / sizeThe number of ranked stoichiometry candidates retained per target size.Share thresholdThe minimum predicted share used to prune the low-population tail.Atom type definitionsWhich elements are treated as nodes, linkers, and shell atoms.Pair cutoffsThe structural neighborhood rules used in cluster extraction and geometry statistics.Shell optionsWhether shell atoms are counted in stoichiometry labels and how shell growth is handled.Experimental dataWhether a fitted SAXS comparison is built for the predicted mixture.
Important constraints and limitations¶
The current algorithm is intentionally conservative. Its main constraints are:
- Small-data regime: The model assumes that only a modest number of observed cluster labels are available. It is designed to work in a data-sparse extrapolation setting.
- Extrapolation by composition trends: The regression model only sees node count, total atom count, and non-node to node ratios. It does not learn a latent representation from raw coordinates.
- Median-based geometry summaries: Geometry is driven by empirical medians rather than full probabilistic distributions.
- Single representative structure per predicted candidate: The output is a representative structure, not an ensemble of conformers. The Debye-Waller extension partially restores thermal broadening in the SAXS trace but does not replace an explicit conformer ensemble.
- No explicit energy minimization: The placement routine is geometry-guided and penalty-based; it is not a molecular mechanics or DFT relaxation.
- No atom-identity tracking: The kinetics come from count changes over time bins, not persistent atomwise trajectories of individual cluster instances.
- Not a graph neural network: The current implementation does not learn directly from the full graph or 3D coordinate tensor of each structure.
These constraints are deliberate. They keep the workflow transparent and stable in the low-data extrapolation regime, but they also limit how expressive the model can be.
Similar machine-learning algorithms¶
The current method is closest to a regularized, feature-engineered predictive model. Related model families include:
- Ridge regression: The direct ancestor of the current scalar prediction model.1
- Lasso and elastic net: Useful when feature selection or stronger sparsity is desired.2 3
- Gaussian process regression: A flexible predictive model with uncertainty estimates, often useful in small-data scientific settings.4
- Message passing neural networks: A much more expressive graph-based alternative for molecular property prediction when larger training sets are available.5
Why the current algorithm was chosen¶
Cluster Dynamics ML is trying to extrapolate from a small observed size series to larger unobserved clusters. In that setting, a simple regularized model has practical advantages:
- easier to inspect and debug
- less likely to overfit a tiny observed set
- easier to constrain with chemical support rules
- easier to couple to explicit geometry heuristics
- easier to explain when a predicted stoichiometry or structure looks wrong
That tradeoff is the main reason the implementation currently favors weighted ridge-style regression plus geometry rules over a higher-capacity learned model.
TODO¶
The current Debye-Waller workflow is intentionally scoped to Cluster Dynamics ML result inspection and to the single-structure component traces built inside that tool. A later extension may expose these pairwise \(B\) or \(\sigma\) coefficients to the main SAXS prefit and DREAM refinement templates as optional refinable parameters, but that is not yet part of the default SAXSShell model workflow.
Related pages¶
References¶
-
Hoerl, A. E., and Kennard, R. W. "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics 12, no. 1 (1970): 55-67. https://doi.org/10.1080/00401706.1970.10488634 ↩
-
Tibshirani, R. "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58, no. 1 (1996): 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x ↩
-
Zou, H., and Hastie, T. "Regularization and Variable Selection via the Elastic Net." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, no. 2 (2005): 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x ↩
-
Rasmussen, C. E., and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. https://gaussianprocess.org/gpml ↩
-
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. "Neural Message Passing for Quantum Chemistry." Proceedings of the 34th International Conference on Machine Learning 70 (2017): 1263-1272. https://proceedings.mlr.press/v70/gilmer17a.html ↩