SindbadML Module
SindbadML
The SindbadML
package provides the core functionality for integrating machine learning (ML) and hybrid modeling capabilities into the SINDBAD framework. It enables the use of neural networks and other ML models alongside process-based models for parameter learning, and potentially hybrid modeling, and advanced optimization.
Purpose
This package brings together all components required for hybrid (process-based + ML) modeling in SINDBAD, including data preparation, model construction, training routines, gradient computation, and optimizer management. It supports flexible configuration, cross-validation, and seamless integration with SINDBAD's process-based modeling workflows.
Dependencies
Distributed
: Parallel and distributed computing utilities (nworkers
,pmap
,workers
,nprocs
,CachingPool
).Sindbad
,SindbadTEM
,SindbadSetup
: Core SINDBAD modules for process-based modeling and setup.SindbadData.YAXArrays
,SindbadData.Zarr
,SindbadData.AxisKeys
,SindbadData
: Data handling, array, and cube utilities.SindbadMetrics
: Metrics for model performance/loss evaluation.Enzyme
,Zygote
,ForwardDiff
,FiniteDiff
,FiniteDifferences
,PolyesterForwardDiff
: Automatic and numerical differentiation libraries for gradient-based learning.Flux
: Neural network layers and training utilities for ML models.Optimisers
: Optimizers for training neural networks.Statistics
: Statistical utilities.ProgressMeter
: Progress bars for ML training and evaluation (@showprogress
,Progress
,next!
,progress_pmap
,progress_map
).PreallocationTools
: Tools for efficient memory allocation.Base.Iterators
: Iterators for batching and repetition (repeated
,partition
).Random
: Random number utilities.JLD2
: For saving and loading model checkpoints and fold indices.
Included Files
utilsML.jl
: Utility functions for ML workflows.diffCaches.jl
: Caching utilities for differentiation.activationFunctions.jl
: Implements various activation functions, including custom and Flux-provided activations.mlModels.jl
: Constructors and utilities for building neural network models and other ML architectures.mlOptimizers.jl
: Functions for creating and configuring optimizers for ML training.loss.jl
: Loss functions and utilities for evaluating model performance and computing gradients.prepHybrid.jl
: Prepares all data structures, loss functions, and ML components required for hybrid modeling, including data splits and feature extraction.mlGradient.jl
: Routines for computing gradients using different libraries and methods, supporting both automatic and finite difference differentiation.mlTrain.jl
: Training routines for ML and hybrid models, including batching, checkpointing, and evaluation.neuralNetwork.jl
: Neural network utilities and architectures.siteLosses.jl
: Site-specific loss calculation utilities.oneHots.jl
: One-hot encoding utilities.loadCovariates.jl
: Functions for loading and handling covariate data.
Notes
The package is modular and extensible, allowing users to add new ML models, optimizers, activation functions, and training methods.
It is tightly integrated with the SINDBAD ecosystem, ensuring consistent data handling and reproducibility across hybrid and process-based modeling workflows.
Exported
SindbadML.JoinDenseNN Method
JoinDenseNN(models::Tuple)
Arguments:
- models :: a tuple of models, i.e. (m1, m2)
Returns:
- all parameters as a vector or matrix (multiple samples)
Example
using SindbadML
using Flux
using Random
Random.seed!(123)
m_big = Chain(Dense(4 => 5, relu), Dense(5 => 3), Flux.sigmoid)
m_eta = Dense(1=>1, Flux.sigmoid)
x_big_a = rand(Float32, 4, 10)
x_small_a1 = rand(Float32, 1, 10)
x_small_a2 = rand(Float32, 1, 10)
model = JoinDenseNN((m_big, m_eta))
model((x_big_a, x_small_a2))
SindbadML.activationFunction Function
activationFunction(model_options, act::AbstractActivation)
Return the activation function corresponding to the specified activation type and model options.
This function dispatches on the activation type to provide the appropriate activation function for use in neural network layers. For custom activation types, relevant parameters can be passed via model_options
.
Arguments
model_options
: A struct or NamedTuple containing model options, including parameters for custom activation functions (e.g.,k_σ
forCustomSigmoid
).act
: An activation type specifying the desired activation function. Supported types include:FluxRelu
: Rectified Linear Unit (ReLU) activation.FluxTanh
: Hyperbolic Tangent (tanh) activation.FluxSigmoid
: Sigmoid activation.CustomSigmoid
: Custom sigmoid activation with steepness parameterk_σ
.
Returns
- A callable activation function suitable for use in neural network layers.
Example
act_fn = activationFunction(model_options, FluxRelu())
y = act_fn(x)
SindbadML.denseNN Method
denseNN(in_dim::Int, n_neurons::Int, out_dim::Int; extra_hlayers=0, activation_hidden=Flux.relu, activation_out= Flux.sigmoid, seed=1618)
Arguments
in_dim
: input dimensionn_neurons
: number of neurons in each hidden layerout_dim
: output dimensionextra_hlayers
=0: controls the number of extra hidden layers, default iszero
activation_hidden
=Flux.relu: activation function within hidden layers, default is Reluactivation_out
= Flux.sigmoid: activation of output layer, default is sigmoidseed=1618
: Random seed, default is ~ (1+√5)/2
Returns a Flux.Chain
neural network.
SindbadML.destructureNN Method
destructureNN(model; nn_opt=Optimisers.Adam())
Given a model
returns a flat
vector with all weights, a re
structure of the neural network and the current state
.
Arguments
model
: a Flux.Chain neural network.nn_opt
: Optimiser, the default isOptimisers.Adam()
.
Returns:
flat :: a flat vector with all network weights
re :: an object containing the model structure, used later to
re
construct the neural networkopt_state :: the state of the optimiser
SindbadML.epochLossComponents Method
epochLossComponents(loss_functions::F, loss_array_sites, loss_array_components, epoch_number, scaled_params, sites_list) where {F}
Compute and store the loss metrics and loss components for each site in parallel for a given training epoch.
This function evaluates the provided loss functions for each site using the current scaled parameters, and stores the resulting scalar loss metrics and loss component vectors in the corresponding arrays for the specified epoch. Parallel execution is used to accelerate computation across sites.
Arguments
loss_functions::F
: An array or KeyedArray of loss functions, one per site (whereF
is a subtype ofAbstractArray{<:Function}
).loss_array_sites
: A matrix to store the scalar loss metric for each site and epoch (dimensions: site × epoch).loss_array_components
: A 3D tensor to store the loss components for each site, component, and epoch (dimensions: site × component × epoch).epoch_number
: The current epoch number (integer).scaled_params
: A callable or array providing the scaled parameters for each site (e.g.,scaled_params(site=site_name)
).sites_list
: List or array of site identifiers to process.
Notes
The function uses Julia's threading (
Threads.@spawn
) to compute losses for multiple sites in parallel.Each site's loss metric and components are stored at the corresponding index for the current epoch.
Designed for use within training loops to track loss evolution over epochs.
Example
epochLossComponents(loss_functions, loss_array_sites, loss_array_components, epoch, scaled_params, sites)
SindbadML.getCacheFromOutput Function
getCacheFromOutput(loc_output, ::MLGradType)
getCacheFromOutput(loc_output, ::ForwardDiffGrad)
getCacheFromOutput(loc_output, ::PolyesterForwardDiffGrad)
Returns the appropriate Cache type based on the automatic differentiation or finite differences package being used.
Arguments
loc_output
: The local outputSecond argument specifies the differentiation method:
ForwardDiffGrad
: Uses ForwardDiff.jl for automatic differentiationMLGradType
: All other libraries, e.g., FiniteDiff.jl,FiniteDifferences.jl, etc. for gradient calculationsPolyesterForwardDiffGrad
: Uses PolyesterForwardDiff.jl for automatic differentiation
SindbadML.getIndicesSplit Function
getIndicesSplit(info, sites, fold_type)
Determine the indices for training, validation, and testing site splits for hybrid (ML) modeling in SINDBAD.
This function dispatches on the fold_type
argument to either load precomputed folds from file or to compute the splits on-the-fly based on the provided split ratios and number of folds.
Arguments
info
: The SINDBAD experiment info structure, containing hybrid modeling configuration.sites
: Array of site identifiers (e.g., site names or indices).fold_type
: Determines the splitting strategy. UseLoadFoldFromFile()
to load folds from file, orCalcFoldFromSplit()
to compute splits dynamically.
Returns
indices_training
: Indices of sites assigned to the training set.indices_validation
: Indices of sites assigned to the validation set.indices_testing
: Indices of sites assigned to the testing set.
Notes
When using
LoadFoldFromFile
, the function loads fold indices from the file specified ininfo.hybrid.fold.fold_path
.When using
CalcFoldFromSplit
, the function splits the sites according to the ratios and number of folds specified ininfo.hybrid.ml_training.options
.Ensures reproducibility by using the random seed from
info.hybrid.random_seed
when shuffling sites.
Example
indices_train, indices_val, indices_test = getIndicesSplit(info, sites, info.hybrid.fold.fold_type)
SindbadML.getInnerArgs Method
getInnerArgs(idx, grads_lib, scaled_params_batch, parameter_scaling_type, selected_models, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, loc_land, tem_info, parameter_to_index, parameter_scaling_type, space_observations, cost_options, constraint_method, indices_batch, sites_batch)
Function to get inner arguments for the loss function.
Arguments
idx
: index batch valuegrads_lib
: gradient libraryscaled_params_batch
: scaled parameters batchselected_models
: selected modelsspace_forcing
: forcing data locationspace_spinup_forcing
: spinup forcing data locationloc_forcing_t
: forcing data time for one time step.space_output
: output data locationloc_land
: initial land statetem_info
: model informationparameter_to_index
: parameter to indexparameter_scaling_type
: type determining parameter scalingloc_observations
: observation data locationcost_options
: cost optionsconstraint_method
: constraint methodindices_batch
: indices batchsites_batch
: sites batch
SindbadML.getLossForSites Method
getLossForSites(gradient_lib, loss_function::F, loss_array_sites, loss_array_split, epoch_number, scaled_params, sites_list, indices_sites, models, space_forcing, space_spinup_forcing, loc_forcing_t, space_output, loc_land, tem_info, parameter_to_index, parameter_scaling_type, space_observations, cost_options, constraint_method) where {F}
Calculates the loss for all sites. The loss is calculated using the loss_function
function. The loss_array_sites
and loss_array_split
arrays are updated with the loss values. The loss_array_sites
array stores the loss values for each site and epoch, while the loss_array_split
array stores the loss values for each model output and epoch.
Arguments
gradient_lib
: gradient libraryloss_function
: loss functionloss_array_sites
: array to store the loss values for each site and epochloss_array_split
: array to store the loss values for each model output and epochepoch_number
: epoch numberscaled_params
: scaled parameterssites_list
: list of sitesindices_sites
: indices of sitesmodels
: list of modelsspace_forcing
: forcing data locationspace_spinup_forcing
: spinup forcing data locationloc_forcing_t
: forcing data time for one time step.space_output
: output data locationloc_land
: initial land statetem_info
: model informationparameter_to_index
: parameter to indexspace_observations
: observation data locationcost_options
: cost optionsconstraint_method
: constraint method
SindbadML.getLossFunctionHandles Method
getLossFunctionHandles(info, run_helpers, sites)
Construct loss function handles for each site for use in hybrid (ML) modeling in SINDBAD.
This function generates callable loss functions and loss component functions for each site, encapsulating all necessary arguments and configuration from the experiment info
and runtime helpers. These handles are used during training and evaluation to compute the loss and its components for each site efficiently.
Arguments
info
: The SINDBAD experiment info structure, containing model, optimization, and hybrid configuration.run_helpers
: Helper object returned byprepTEM
, containing prepared model, forcing, observation, and output structures.sites
: Array of site indices or identifiers for which to build loss functions.
Returns
loss_functions
: AKeyedArray
of callable loss functions, one per site. Each function takes model parameters as input and returns the scalar loss for that site.loss_component_functions
: AKeyedArray
of callable functions, one per site, that return the vector of loss components (e.g., for multi-objective or constraint-based loss).
Notes
Each loss function is closed over all required data and options for its site, including model structure, parameter indices, scaling, forcing, observations, output cache, cost options, and hybrid/optimization settings.
The returned arrays are keyed by site for convenient lookup and iteration.
Example
loss_functions, loss_component_functions = getLossFunctionHandles(info, run_helpers, sites)
site_loss = loss_functions[site_index](params)
site_loss_components = loss_component_functions[site_index](params)
SindbadML.getOutputFromCache Function
getOutputFromCache(loc_output, _, ::MLGradType)
getOutputFromCache(loc_output, new_params, ::ForwardDiffGrad)
getOutputFromCache(loc_output, new_params, ::PolyesterForwardDiffGrad)
Retrieves output values from Cache
based on the differentiation method being used.
Arguments
loc_output
: The cached output values_
ornew_params
: Additional parameters (only used with ForwardDiff)Third argument specifies the differentiation method:
MLGradType
: Returns cached output directly when using other libraries, e.g., FiniteDiff.jl, FiniteDifferences.jl, etc.ForwardDiffGrad
: Processes cached output with new parameters when using ForwardDiff.jl, returnsget_tmp.(loc_output, (new_params,))
PolyesterForwardDiffGrad
: Calls cached output with new parameters using ForwardDiff.jl
SindbadML.getParamsAct Method
getParamsAct(x, parameter_table)
Scales x
values in the [0,1] interval to some given lower lo_b
and upper up_b
bounds.
Arguments
x
: vector arrayparameter_table
: a Table with input fieldsdefault
,lower
andupper
that match thex
vector.
Returns a vector array with new values scaled into the new interval [lower, upper]
.
SindbadML.getPullback Function
getPullback(flat, re, features::AbstractArray)
getPullback(flat, re, features::Tuple)
Arguments:
flat :: weight parameters.
re :: model structure (vanilla Chain Dense Layers).
features ::
n
predictors ands
samples.A vector of predictors
A matrix of predictors:
(p_n x s)
A tuple vector of predictors:
(p1, p2)
A tuple of matrices of predictors:
[(p1_n x s), (p2_n x s)]
Returns:
- new parameters and pullback function
Example
Here we do one input features vector or matrix.
using SindbadML
using Flux
# model
m = Chain(Dense(4 => 5, relu), Dense(5 => 3), Flux.sigmoid)
# features
_feat = rand(Float32, 4)
# apply
flat, re = destructureNN(m)
# Zygote
new_params, pullback_func = getPullback(flat, re, _feat)
# ? or
_feat_ns = rand(Float32, 4, 3) # `n` predictors and `s` samples.
new_params, pullback_func = getPullback(flat, re, _feat_ns)
Example
Here we do one multiple input features vector or matrix.
using SindbadML
using Flux
# model
m1 = Chain(Dense(4 => 5, relu), Dense(5 => 3), Flux.sigmoid)
m2 = Dense(2=>1, Flux.sigmoid)
combo_ms = JoinDenseNN((m1, m2))
# features
_feat1 = rand(Float32, 4)
_feat2 = rand(Float32, 2)
# apply
flat, re = destructureNN(combo_ms)
# Zygote
new_params, pullback_func = getPullback(flat, re, (_feat1, _feat2))
# ? or with multiple samples
_feat1_ns = rand(Float32, 4, 3) # `n` predictors and `s` samples.
_feat2_ns = rand(Float32, 2, 3) # `n` predictors and `s` samples.
new_params, pullback_func = getPullback(flat, re, (_feat1_ns, _feat2_ns))
SindbadML.gradientBatch! Function
gradientBatch!(grads_lib, grads_batch, chunk_size::Int, loss_f::Function, get_inner_args::Function, input_args...; showprog=false)
gradientBatch!(grads_lib, grads_batch, gradient_options::NamedTuple, loss_functions, scaled_params_batch, sites_batch; showprog=false)
Compute gradients for a batch of samples in hybrid (ML) modeling in SINDBAD.
This function computes the gradients of the loss function with respect to model parameters for a batch of sites or samples, using the specified gradient library. It supports both distributed and multi-threaded execution, and can handle different gradient computation backends (e.g., PolyesterForwardDiff
, ForwardDiff
, FiniteDiff
, etc.).
Arguments
grads_lib
: Gradient computation library or method. Supported types include:PolyesterForwardDiffGrad
: UsesPolyesterForwardDiff.jl
for multi-threaded chunked gradients.Other
MLGradType
subtypes: Use their respective backend.
grads_batch
: Pre-allocated array for storing batched gradients (size: n_parameters × n_samples).chunk_size
: (Optional) Chunk size for threaded gradient computation (used byPolyesterForwardDiffGrad
).gradient_options
: (Optional) NamedTuple of gradient options (e.g., chunk size).loss_f
: Loss function to be applied (for all samples).get_inner_args
: Function to obtain inner arguments for the loss function.input_args
: Global input arguments for the batch.loss_functions
: Array or KeyedArray of loss functions, one per site.scaled_params_batch
: Callable or array providing scaled parameters for each site.sites_batch
: List or array of site identifiers for the batch.showprog
: (Optional) Iftrue
, display a progress bar during computation (default:false
).
Returns
- Updates
grads_batch
in-place with computed gradients for each sample in the batch.
Notes
The function automatically selects between distributed (
pmap
) and multi-threaded (Threads.@spawn
) execution depending on the backend and arguments.Designed for use within training loops for efficient batch gradient computation.
Example
gradientBatch!(grads_lib, grads_batch, (chunk_size=4,), loss_functions, scaled_params_batch, sites_batch; showprog=true)
SindbadML.gradientSite Function
gradientSite(grads_lib, x_vals, chunk_size::Int, loss_f::Function, args...)
gradientSite(grads_lib, x_vals, gradient_options::NamedTuple, loss_f::Function)
gradientSite(grads_lib, x_vals::AbstractArray, gradient_options::NamedTuple, loss_f::Function)
Compute gradients of the loss function with respect to model parameters for a single site using the specified gradient library.
This function dispatches on the type of grads_lib
to select the appropriate differentiation backend (e.g., PolyesterForwardDiff
, ForwardDiff
, FiniteDiff
, FiniteDifferences
, Zygote
, or Enzyme
). It supports both threaded and single-threaded computation, as well as chunked evaluation for memory and speed trade-offs.
Arguments
grads_lib
: Gradient computation library or method. Supported types include:PolyesterForwardDiffGrad
: UsesPolyesterForwardDiff.jl
for multi-threaded chunked gradients.ForwardDiffGrad
: UsesForwardDiff.jl
for automatic differentiation.FiniteDiffGrad
: UsesFiniteDiff.jl
for finite difference gradients.FiniteDifferencesGrad
: UsesFiniteDifferences.jl
for finite difference gradients.ZygoteGrad
: UsesZygote.jl
for reverse-mode automatic differentiation.EnzymeGrad
: UsesEnzyme.jl
for AD (experimental).
x_vals
: Parameter values for which to compute gradients.chunk_size
: (Optional) Chunk size for threaded gradient computation (used byPolyesterForwardDiffGrad
).gradient_options
: (Optional) NamedTuple of gradient options (e.g., chunk size).loss_f
: Loss function to be differentiated.args...
: Additional arguments to be passed to the loss function.
Returns
∇x
: Array of gradients of the loss function with respect tox_vals
.
Notes
On Apple M1 systems,
PolyesterForwardDiffGrad
falls back to single-threadedForwardDiff
due to closure issues.The function is used internally for both site-level and batch-level gradient computation in hybrid ML training.
Example
grads = gradientSite(ForwardDiffGrad(), x_vals, (chunk_size=4,), loss_f)
SindbadML.gradsNaNCheck! Method
gradsNaNCheck!(grads_batch, _params_batch, sites_batch, parameter_table; show_params_for_nan=false)
Utility function to check if some calculated gradients were NaN (if found please double check your approach). This function will replace those NaNs with 0.0f0.
Arguments
grads_batch
: gradients array._params_batch
: parameters values.sites_batch
: sites names.parameter_table
: parameters table.show_params_for_nan=false
: if true, it will show the parameters that caused the NaNs.
SindbadML.lcKAoneHotbatch Method
lcKAoneHotbatch(lc_data, up_bound, lc_name, ka_labels)
Arguments
lc_data
: Vector arrayup_bound
: last index class, the range goes from1:up_bound
, and any case not in that range uses theup_bound
value. ForPFT
use17
and forKG
32
.lc_name
: land cover approach, eitherKG
orPFT
.ka_labels
: KeyedArray labels, i.e. site names
SindbadML.loadCovariates Method
loadCovariates(sites_forcing; kind="all")
use the kind
argument to select different sets of covariates
Arguments
sites_forcing: names of forcing sites
kind: defaults to "all"
Other options
PFT
KG
KG_PFT
PFT_ABCNOPSWB
KG_ABCNOPSWB
ABCNOPSWB
veg_all
veg
KG_veg
veg_ABCNOPSWB
SindbadML.loss Method
loss(params, models, parameter_to_index, parameter_scaling_type, loc_forcing, loc_spinup_forcing, loc_forcing_t, loc_output, land_init, tem_info, loc_obs, cost_options, constraint_method, gradient_lib, ::LossModelObsML)
Calculates the scalar loss for a given site in hybrid (ML) modeling in SINDBAD.
This function computes the loss value for a given site by first calling lossVector
to obtain the vector of loss components, and then combining them into a scalar loss using the combineMetric
function and the specified constraint method.
Arguments
params
: Model parameters (typically output from an ML model).models
: List of process-based models.parameter_to_index
: Mapping from parameter names to indices.parameter_scaling_type
: Parameter scaling configuration.loc_forcing
: Forcing data for the site.loc_spinup_forcing
: Spinup forcing data for the site.loc_forcing_t
: Forcing data for a single time step.loc_output
: Output data structure for the site.land_init
: Initial land state.tem_info
: Model information and configuration.loc_obs
: Observation data for the site.cost_options
: Cost function and metric configuration.constraint_method
: Constraint method for combining metrics.gradient_lib
: Gradient computation library or method.::LossModelObsML
: Type dispatch for loss model with observations and machine learning.
Returns
t_loss
: Scalar loss value for the site.
Notes
This function is used internally by higher-level training and evaluation routines.
The loss is computed by aggregating the loss vector using the specified constraint method.
Example
t_loss = loss(params, models, parameter_to_index, parameter_scaling_type, loc_forcing, loc_spinup_forcing, loc_forcing_t, loc_output, land_init, tem_info, loc_obs, cost_options, constraint_method, gradient_lib, LossModelObsML())
SindbadML.lossSite Method
lossSite(new_params, gradient_lib, models, loc_forcing, loc_spinup_forcing, loc_forcing_t, loc_output, land_init, tem_info, parameter_to_index, parameter_scaling_type, loc_obs, cost_options, constraint_method; optim_mode=true)
Function to calculate the loss for a given site. This is used for optimization, hence the optim_mode
argument is set to true
by default. Also, a gradient library should be set as well as new parameters to update the models. See all input arguments in the function:
Arguments
new_params
: new parametersgradient_lib
: gradient librarymodels
: list of modelsloc_forcing
: forcing data locationloc_spinup_forcing
: spinup forcing data locationloc_forcing_t
: forcing data time for one time step.loc_output
: output data locationland_init
: initial land statetem_info
: model informationparameter_to_index
: parameter to indexloc_obs
: observation data locationcost_options
: cost optionsconstraint_method
: constraint method
SindbadML.lossVector Method
lossVector(params, models, parameter_to_index, parameter_scaling_type, loc_forcing, loc_spinup_forcing, loc_forcing_t, loc_output, land_init, tem_info, loc_obs, cost_options, constraint_method, gradient_lib, ::LossModelObsML)
Calculate the loss vector for a given site in hybrid (ML) modeling in SINDBAD.
This function runs the core TEM model with the provided parameters, forcing data, initial land state, and model information, then computes the loss vector using the specified cost options and metrics. It is typically used for site-level loss evaluation during training and validation.
Arguments
params
: Model parameters (in this case, output from an ML model).models
: List of process-based models.parameter_to_index
: Mapping from parameter names to indices.parameter_scaling_type
: Parameter scaling configuration.loc_forcing
: Forcing data for the site.loc_spinup_forcing
: Spinup forcing data for the site.loc_forcing_t
: Forcing data for a single time step.loc_output
: Output data structure for the site.land_init
: Initial land state.tem_info
: Model information and configuration.loc_obs
: Observation data for the site.cost_options
: Cost function and metric configuration.constraint_method
: Constraint method for combining metrics.gradient_lib
: Gradient computation library or method.::LossModelObsML
: Type dispatch for loss model with observations and machine learning.
Returns
loss_vector
: Vector of loss components for the site.loss_indices
: Indices corresponding to each loss component.
Notes
This function is used internally by higher-level loss and training routines.
The loss vector is typically combined into a scalar loss using
combineMetric
.
Example
loss_vec, loss_idx = lossVector(params, models, parameter_to_index, parameter_scaling_type, loc_forcing, loc_spinup_forcing, loc_forcing_t, loc_output, land_init, tem_info, loc_obs, cost_options, constraint_method, gradient_lib, LossModelObsML())
SindbadML.mixedGradientTraining Method
mixedGradientTraining(grads_lib, nn_model, train_refs, test_val_refs, loss_fargs, forward_args; n_epochs=3, optimizer=Optimisers.Adam(), path_experiment="/")
Training function that computes model parameters using a neural network, which are then used by process-based models (PBMs) to estimate parameter gradients. Neural network weights are updated using the product of these gradients with the neural network's Jacobian.
Arguments
grads_lib
: Library to compute PBMs parameter gradients.nn_model
: AFlux.Chain
neural network.train_refs
: training data features.test_val_refs
: test and validation data features.loss_fargs
: functions used to calculate the loss.forward_args
: arguments to evaluate the PBMs.path_experiment="/"
: save model to path.
SindbadML.mlModel Function
mlModel(info, n_features, ::MLModelType)
Builds a Flux dense neural network model. This function initializes a neural network model based on the provided info
and n_features
.
Arguments
info
: The experiment information containing model options and parameters.n_features
: The number of features in the input data.::MLModelType
: Type dispatch for the machine learning model type.
Supported MLModelType:
::FluxDenseNN
: A simple dense neural network model implemented in Flux.jl.
Returns
The initialized machine learning model.
SindbadML.mlOptimizer Function
mlOptimizer(optimizer_options, ::MLOptimizerType)
Create a ML optimizer from the given options and type. The optimizer is created using the given options and type. The options are passed to the constructor of the optimizer.
Arguments:
optimizer_options
: A dictionary or NamedTuple containing options for the optimizer.::MLOptimizerType
: The type used to determine which optimizer to create. Supported types include:OptimisersAdam
: For Adam optimizer.OptimisersDescent
: For Descent optimizer.
.
Returns:
- A ML optimizer object that can be used to optimize machine learning models.
SindbadML.oneHotPFT Method
oneHotPFT(pft, up_bound, veg_class)
Arguments
pft
: (Plant Functional Type). Any entry not in 1:17 would be set to the last index, this includes NaN! Last index is water/NaNup_bound
: last index class, the range goes from1:up_bound
, and any case not in that range uses theup_bound
value. ForPFT
use17
.veg_class
:true
orfalse
.
Returns a vector.
SindbadML.partitionBatches Method
partitionBatches(n; batch_size=32)
Return an Iterator partitioning a dataset into batches.
Arguments
n
: number of samplesbatch_size
: batch size
SindbadML.prepHybrid Method
prepHybrid(forcing, observations, info, ::MLTrainingType)
Prepare all data structures, loss functions, and machine learning components required for hybrid (process-based + machine learning) modeling in SINDBAD.
This function orchestrates the setup for hybrid modeling by:
Initializing model helpers and runtime structures.
Building loss function handles for each site.
Splitting sites into training, validation, and testing sets according to the hybrid configuration.
Loading covariate features for all sites.
Building the machine learning model as specified in the configuration.
Preparing arrays for storing losses and loss components during training and evaluation.
Initializing the optimizer for ML training.
Collecting all relevant metadata and configuration into a single
hybrid_helpers
NamedTuple for downstream training routines.
Arguments
forcing
: Forcing data structure as required by the process-based model.observations
: Observational data structure.info
: The SINDBAD experiment info structure, containing all configuration and runtime options.::MLTrainingType
: Type specifying the ML training method to use (e.g.,MixedGradient
).
Returns
hybrid_helpers
: A NamedTuple containing all prepared data, models, loss functions, indices, features, optimizers, and arrays needed for hybrid ML training and evaluation.
Fields of hybrid_helpers
run_helpers
: Output ofprepTEM
, containing prepared model, forcing, observation, and output structures.sites
: NamedTuple withtraining
,validation
, andtesting
site arrays.indices
: NamedTuple with indices fortraining
,validation
, andtesting
sites.features
: NamedTuple withn_features
anddata
(covariate features for all sites).ml_model
: The machine learning model instance (e.g., a Flux neural network).options
: Theinfo.hybrid
configuration NamedTuple.checkpoint_path
: Path for saving checkpoints during training.parameter_table
: Parameter table frominfo.optimization
.loss_functions
: KeyedArray of callable loss functions, one per site.loss_component_functions
: KeyedArray of callable loss component functions, one per site.training_optimizer
: The optimizer object for ML training.loss_array
: NamedTuple of arrays to store scalar losses for training, validation, and testing.loss_array_components
: NamedTuple of arrays to store loss components for training, validation, and testing.metadata_global
: Global metadata from the output configuration.
Notes
This function is typically called once at the start of a hybrid modeling experiment to set up all necessary components.
The returned
hybrid_helpers
is designed to be passed directly to training routines such astrainML
.
Example
hybrid_helpers = prepHybrid(forcing, observations, info, MixedGradient())
trainML(hybrid_helpers, MixedGradient())
SindbadML.shuffleBatches Method
shuffleBatches(list, bs; seed=1)
Arguments
bs
: Batch sizelist
: an array of samplesseed
: Int
Returns shuffled partitioned batches.
SindbadML.shuffleList Method
shuffleList(list; seed=123)
Arguments
list
: an array of samplesseed
: Int
SindbadML.siteNameToID Method
siteNameToID(site_name, sites_list)
Returns the index of site_name
in the sites_list
Arguments
site_name
: site namesites_list
: list of site names
SindbadML.toClass Method
toClass(x::Number; vegetation_rules)
Arguments
x
: a key(Number)
fromvegetation_rules
vegetation_rules
SindbadML.trainML Method
trainML(hybrid_helpers, ::MLTrainingType)
Train a machine learning (ML) or hybrid model in SINDBAD using the specified training method.
This function performs the training loop for the ML model, handling batching, gradient computation, optimizer updates, loss calculation, and checkpointing. It supports hybrid modeling workflows where ML-derived parameters are used in process-based models, and is designed to work with the data structures prepared by prepHybrid
.
Arguments
hybrid_helpers
: NamedTuple containing all prepared data, models, loss functions, indices, features, optimizers, and arrays needed for ML training and evaluation (as returned byprepHybrid
).::MLTrainingType
: Type specifying the ML training method to use (e.g.,MixedGradient
).
Workflow
Iterates over epochs and batches of training sites.
For each batch:
Extracts features and computes model parameters.
Computes gradients using the specified gradient method.
Checks for NaNs in gradients and replaces them if needed.
Updates model parameters using the optimizer.
After each epoch:
Computes and stores losses and loss components for training, validation, and testing sets.
Saves model checkpoints and loss arrays to disk if a checkpoint path is specified.
Notes
The function is extensible to support different training strategies via dispatch on
MLTrainingType
.Designed for use with hybrid modeling, where ML models provide parameters to process-based models.
Checkpointing enables resuming or analyzing training progress.
Example
hybrid_helpers = prepHybrid(forcing, observations, info, MixedGradient())
trainML(hybrid_helpers, MixedGradient())
SindbadML.vegKAoneHotbatch Method
vegKAoneHotbatch(pft_data, ka_labels)
Arguments
pft_data
: Vector arrayka_labels
: KeyedArray labels, i.e. site names
SindbadML.vegOneHot Method
vegOneHot(v_class; vegetation_labels)
Arguments
v_class
: get it by doingtoClass(x; vegetation_rules)
.vegetation_labels
: see them by typingvegetation_labels
.
SindbadML.vegOneHotbatch Method
vegOneHotbatch(veg_classes; vegetation_labels)
Arguments
veg_classes: get these from
toClass.([x1, x2,...])
vegetation_labels: see them by typing
vegetation_labels
Internal
SindbadML.batchShuffler Method
batchShuffler(x_forcings, ids_forcings, batch_size; bs_seed=1456)
Shuffles the batches of forcings and their corresponding indices.
SindbadML.getLoss Method
getLoss(models, loc_forcing, loc_spinup_forcing, loc_forcing_t, loc_output, land_init, tem_info, loc_obs, cost_options, constraint_method; optim_mode=true)
Calculates the loss for a given site. At this stage model parameters should had been set. The loss is calculated using the metricVector
and combineMetric
functions. The metricVector
function calculates the loss for each model output and the combineMetric
function combines the losses into a single value.
Arguments
models
: list of modelsloc_forcing
: forcing data locationloc_spinup_forcing
: spinup forcing data locationloc_forcing_t
: forcing data time for one time step.loc_output
: output data locationland_init
: initial land statetem_info
: model informationloc_obs
: observation data locationcost_options
: cost optionsconstraint_method
: constraint method
The optional argument optim_mode
is used to return the loss value only when set to true
. Otherwise, it returns the loss value, the loss vector, and the loss indices.
SindbadML.getNFolds Method
getNFolds(sites, train_ratio, val_ratio, test_ratio, n_folds, batch_size; seed=1234)
Partition a list of sites into training, validation, and testing sets for k-fold cross-validation in hybrid (ML) modeling.
This function shuffles the input sites
array using the provided random seed
for reproducibility, then splits the sites into n_folds
folds. It computes the number of sites for each partition based on the provided ratios, ensuring the training set size is a multiple of batch_size
. The function returns the indices for training, validation, and testing sets, as well as the full list of folds.
Arguments
sites
: Array of site identifiers (e.g., site names or indices).train_ratio
: Fraction of sites to assign to the training set.val_ratio
: Fraction of sites to assign to the validation set.test_ratio
: Fraction of sites to assign to the testing set.n_folds
: Number of folds for cross-validation.batch_size
: Batch size for training; training set size will be rounded down to a multiple of this value.seed
: (Optional) Random seed for reproducibility (default: 1234).
Returns
train_indices
: Array of sites assigned to the training set.val_indices
: Array of sites assigned to the validation set.test_indices
: Array of sites assigned to the testing set.folds
: Vector of arrays, each containing the sites for one fold.
Notes
The sum of
train_ratio
,val_ratio
, andtest_ratio
must be approximately 1.0.The returned
folds
can be used for further cross-validation or analysis.
Example
train_indices, val_indices, test_indices, folds = getNFolds(sites, 0.7, 0.15, 0.15, 5, 32; seed=42)
SindbadML.scaleToBounds Method
scaleToBounds(x, lo_b, up_b)
Scales values in the [0,1] interval to some given lower lo_b
and upper up_b
bounds.
Arguments
x
: vector arraylo_b
: lower boundup_b
: upper bound