Customize a Dataset Configuration¶
Overview¶
The main task in setting up a training procedure with metatrain is to provide
files for training, validation, and testing datasets. Our system allows flexibility in
parsing data for training. Mandatory sections in the options.yaml file include:
training_settest_setvalidation_set
Each section can follow a similar system, with shorthand methods available to simplify dataset definitions.
Minimal Configuration Example¶
Below is the simplest form of these sections:
training_set: "dataset.xyz"
test_set: 0.1
validation_set: 0.1
This configuration parses all information from dataset.xyz, with 20% of the training
set randomly selected for testing and validation (10% each). The selected indices for
the training, validation and test subset will be available in the outputs directory.
Expanded Configuration Format¶
The train script automatically expands the training_set section into the following
format, which is also valid for initial input:
training_set:
systems:
read_from: dataset.xyz
reader: ase
length_unit: null
targets:
energy:
quantity: energy
read_from: dataset.xyz
reader: ase
key: energy
unit: null
forces:
read_from: dataset.xyz
reader: ase
key: forces
stress:
read_from: dataset.xyz
reader: ase
key: stress
test_set: 0.1
validation_set: 0.1
Understanding the YAML Block¶
The training_set is divided into sections systems and targets:
Systems Section¶
Describes the system data like positions and cell information.
- param read_from:
The file containing system data.
- param reader:
The reader library to use for parsing, guessed from the file extension if
nullor not provided.- param length_unit:
The unit of lengths, optional but highly recommended for running simulations.
A single string in this section automatically expands, using the string as the
read_from parameter.
Note
metatrain does not convert units during training or evaluation. Units are
only required if model should be used to run MD simulations.
Targets Section¶
Allows defining multiple target sections, each with a unique name.
Commonly, a section named
energyshould be defined, which is essential for running molecular dynamics simulations. For theenergysection gradients likeforcesandstressare enabled by default.Other target sections can also be defined, as long as they are prefixed by
mtt::. For example,mtt::free_energy. In general, all targets that are not standard outputs ofmetatomic(see https://docs.metatensor.org/metatomic/latest/outputs/index.html) should be prefixed bymtt::.
Target section parameters include:
- param quantity:
The target’s quantity (e.g.,
energy,dipole). Currently onlyenergyis supported.- param read_from:
The file for target data, defaults to the
systems.read_fromfile if not provided.- param reader:
The reader library to use for parsing, guessed from the file extension if
nullor not provided.- param key:
The key for reading from the file, defaulting to the target section’s name if not provided.
- param unit:
The unit of the target, optional but highly recommended for running simulations.
- param forces:
Gradient sections. See Gradient Section for parameters.
- param stress:
Gradient sections. See Gradient Section for parameters.
- param virial:
Gradient sections. See Gradient Section for parameters.
A single string in a target section automatically expands, using the string as the
read_from parameter.
Gradient Section¶
Each gradient section (like forces or stress) has similar parameters:
- param read_from:
The file for gradient data.
- param reader:
The reader library to use for parsing, guessed from the file extension if
nullor not provided.:param key: The key for reading from the file.
A single string in a gradient section automatically expands, using the string as the
read_from parameter.
Sections set to true or on automatically expand with default parameters. A
warning is raised if requisite data for a gradient is missing, but training proceeds
without them.
Note
Unknown keys are ignored and not deleted in all sections during dataset parsing.
Multiple Datasets¶
For some applications, it is required to provide more than one dataset for model
training. metatrain supports stacking several datasets together using the
YAML list syntax, which consists of lines beginning at the same indentation level
starting with a "- " (a dash and a space)
training_set:
- systems:
read_from: dataset_0.xyz
length_unit: angstrom
targets:
energy:
quantity: energy
key: my_energy_label0
unit: eV
- systems:
read_from: dataset_1.xyz
length_unit: angstrom
targets:
energy:
quantity: energy
key: my_energy_label1
unit: eV
free-energy:
quantity: energy
key: my_free_energy
unit: hartree
test_set: 0.1
validation_set: 0.1
The required test and validation splits are performed consistently for each element
element in training_set
The length_unit has to be the same for each element of the list. If target section
names are the same for different elements of the list, their unit also has to be the
same. In the the example above the target section energy exists in both list
elements and therefore has the the same unit eV. The target section free-energy
only exists in the second element and its unit does not have to be the same as in the
first element of the list.
Typically the global atomic types the the model is defined for are inferred from the
training and validation datasets. Sometimes, due to shuffling of datasets with low
representation of some types, these datasets may not contain all atomic types that you
want to use in your model. To explicitly control the atomic types the model is defined
for, specify the atomic_types key in the architecture section of the options
file:
architecture:
name: pet
model:
cutoff: 5.0
training:
batch_size: 32
epochs: 100
atomic_types: [1, 6, 7, 8, 16] # i.e. for H, C, N, O, S
Warning
Even though parsing several datasets is supported by the library, it may not work with every architecture. Check your desired architecture if they support multiple datasets.
In the next tutorials we explain and show how to set some advanced global training parameters.
Datasets requiring additional data¶
Some targets require additional data to be passed to the loss function for training. For example, training a model to predict the polarization for extended systems under periodic boundary conditions might require the quantum of polarization to be provided for each system in the dataset.
metatrain supports passing additional data in the options.yaml file.
For example, if you want to train a polarization model, you can add the following
section to your options.yaml file:
training_set:
systems:
read_from: dataset_0.xyz
length_unit: angstrom
targets:
mtt::polarization:
read_from: polarization.mts
extra_data:
polarization_quantum:
read_from: polarization_quantum.mts
Warning
While the extra_data section can always be present, it will typically be ignored
unless using specific loss functions. If the loss function you picked does not
support the extra data, it will be ignored.