Customize a Dataset Configuration¶

Overview¶

The main task in setting up a training procedure with metatrain is to provide files for training, validation, and testing datasets. Our system allows flexibility in parsing data for training. Mandatory sections in the options.yaml file include:

training_set
test_set
validation_set

Each section can follow a similar system, with shorthand methods available to simplify dataset definitions.

Minimal Configuration Example¶

Below is the simplest form of these sections:

training_set: "dataset.xyz"
test_set: 0.1
validation_set: 0.1

This configuration parses all information from dataset.xyz, with 20% of the training set randomly selected for testing and validation (10% each). The selected indices for the training, validation and test subset will be available in the outputs directory.

Expanded Configuration Format¶

The train script automatically expands the training_set section into the following format, which is also valid for initial input:

training_set:
    systems:
        read_from: dataset.xyz
        reader: ase
        length_unit: null
    targets:
        energy:
            quantity: energy
            read_from: dataset.xyz
            reader: ase
            key: energy
            unit: null
            forces:
                read_from: dataset.xyz
                reader: ase
                key: forces
            stress:
                read_from: dataset.xyz
                reader: ase
                key: stress
test_set: 0.1
validation_set: 0.1

Understanding the YAML Block¶

The training_set is divided into sections systems and targets:

Systems Section¶

Describes the system data like positions and cell information.

param read_from:: The file containing system data.
param reader:: The reader library to use for parsing, guessed from the file extension if null or not provided.
param length_unit:: The unit of lengths, optional but highly recommended for running simulations.

A single string in this section automatically expands, using the string as the read_from parameter.

Note

metatrain does not convert units during training or evaluation. Units are only required if model should be used to run MD simulations.

Targets Section¶

Allows defining multiple target sections, each with a unique name.

Commonly, a section named energy should be defined, which is essential for running molecular dynamics simulations. For the energy section gradients like forces and stress are enabled by default.
Other target sections can also be defined, as long as they are prefixed by mtt::. For example, mtt::free_energy. In general, all targets that are not standard outputs of metatomic (see https://docs.metatensor.org/metatomic/latest/outputs/index.html) should be prefixed by mtt::.

Target section parameters include:

param quantity:: The target’s quantity (e.g., energy, dipole). Currently only energy is supported.
param read_from:: The file for target data, defaults to the systems.read_from file if not provided.
param reader:: The reader library to use for parsing, guessed from the file extension if null or not provided.
param key:: The key for reading from the file, defaulting to the target section’s name if not provided.
param unit:: The unit of the target, optional but highly recommended for running simulations.
param forces:: Gradient sections. See Gradient Section for parameters.
param stress:: Gradient sections. See Gradient Section for parameters.
param virial:: Gradient sections. See Gradient Section for parameters.

A single string in a target section automatically expands, using the string as the read_from parameter.

Gradient Section¶

Each gradient section (like forces or stress) has similar parameters:

param read_from:: The file for gradient data.
param reader:: The reader library to use for parsing, guessed from the file extension if null or not provided.:param key: The key for reading from the file.

A single string in a gradient section automatically expands, using the string as the read_from parameter.

Sections set to true or on automatically expand with default parameters. A warning is raised if requisite data for a gradient is missing, but training proceeds without them.

Note

Unknown keys are ignored and not deleted in all sections during dataset parsing.

Multiple Datasets¶

For some applications, it is required to provide more than one dataset for model training. metatrain supports stacking several datasets together using the YAML list syntax, which consists of lines beginning at the same indentation level starting with a "- " (a dash and a space)

training_set:
    - systems:
          read_from: dataset_0.xyz
          length_unit: angstrom
      targets:
          energy:
              quantity: energy
              key: my_energy_label0
              unit: eV
    - systems:
          read_from: dataset_1.xyz
          length_unit: angstrom
      targets:
          energy:
              quantity: energy
              key: my_energy_label1
              unit: eV
          free-energy:
              quantity: energy
              key: my_free_energy
              unit: hartree
test_set: 0.1
validation_set: 0.1

The required test and validation splits are performed consistently for each element element in training_set

The length_unit has to be the same for each element of the list. If target section names are the same for different elements of the list, their unit also has to be the same. In the the example above the target section energy exists in both list elements and therefore has the the same unit eV. The target section free-energy only exists in the second element and its unit does not have to be the same as in the first element of the list.

Typically the global atomic types the the model is defined for are inferred from the training and validation datasets. Sometimes, due to shuffling of datasets with low representation of some types, these datasets may not contain all atomic types that you want to use in your model. To explicitly control the atomic types the model is defined for, specify the atomic_types key in the architecture section of the options file:

architecture:
    name: pet
    model:
        cutoff: 5.0
    training:
        batch_size: 32
        epochs: 100
    atomic_types: [1, 6, 7, 8, 16]  # i.e. for H, C, N, O, S

Warning

Even though parsing several datasets is supported by the library, it may not work with every architecture. Check your desired architecture if they support multiple datasets.

In the next tutorials we explain and show how to set some advanced global training parameters.

Datasets requiring additional data¶

Some targets require additional data to be passed to the loss function for training. For example, training a model to predict the polarization for extended systems under periodic boundary conditions might require the quantum of polarization to be provided for each system in the dataset.

metatrain supports passing additional data in the options.yaml file. For example, if you want to train a polarization model, you can add the following section to your options.yaml file:

training_set:
    systems:
        read_from: dataset_0.xyz
        length_unit: angstrom
    targets:
        mtt::polarization:
            read_from: polarization.mts
    extra_data:
        polarization_quantum:
            read_from: polarization_quantum.mts

Warning

While the extra_data section can always be present, it will typically be ignored unless using specific loss functions. If the loss function you picked does not support the extra data, it will be ignored.