Added configuration management using pydantic #986

benmalef · 2025-02-12T10:41:06Z

add Pydantic configuration

Fixes #ISSUE_NUMBER

Proposed Changes

Add Pydantic configuration
Refactor the config_manager

Checklist

add Pydantic configuration

github-actions · 2025-02-12T10:41:21Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sarthakpati · 2025-02-12T14:11:50Z

Please check the codacy errors: https://app.codacy.com/gh/mlcommons/GaNDLF/pull-requests/986/issues

sarthakpati · 2025-02-12T20:56:59Z

@szmazurek could you please take a first pass?

szmazurek · 2025-02-17T20:32:33Z

@szmazurek could you please take a first pass?

yup, will do in like 1hr or tomorrow morning

szmazurek · 2025-02-18T07:09:15Z

GANDLF/Configuration/Parameters/default_parameters.py

+    save_output: bool = Field(
+        default=False, description="Save outputs during validation/testing."
+    )
+    in_memory: bool = Field(default=False, description="Pin data to CPU memory.")


What does it mean to "pin data to cpu/gpu memory"? Also, is the 'in_memory' really enforcing a page-lock on memory storing given chunk of data or just keeps it all in RAM?

I took these default parameters

GaNDLF/GANDLF/config_manager.py

Line 16 in 3bfe133

parameter_defaults = {

and tried to manipulate them using Pydantic.

pin_memory is used here:

GaNDLF/GANDLF/data/__init__.py

Line 28 in 3bfe133

pin_memory=False, # params["pin_memory_dataloader"], # this is going OOM if True - needs investigation

Regarding the comment about OOM with pin_memory (not really related to the pr tho, just in general) - pin memory may go OOM when the program runs at nearly full RAM utilization available within the machine, as page-locking will prevent parts of memory from being swapped, but I would advocate for allowing the user to try that option too (I did it in the lightning port)

szmazurek · 2025-02-18T07:14:22Z

GANDLF/Configuration/Parameters/default_parameters.py

+    data_postprocessing: Union[dict, set] = Field(
+        default={}, description="Default data postprocessing configuration."
+    )
+    grid_aggregator_overlap: str = Field(


What kind of other options can we have here? I believe this cannot be an arbitraty string, therefore it should be an optional literal here with available strings?

I believe it can be either of these: ["crop", "average", "hann"]

Ref: https://torchio.readthedocs.io/patches/patch_inference.html#grid-aggregator

szmazurek · 2025-02-18T07:15:51Z

GANDLF/Configuration/Parameters/model_parameters.py

+    model_config = ConfigDict(
+        extra="allow"
+    )  #  it allows extra fields in the model dict
+    dimension: Optional[int] = Field(description="Dimension.")


Is the dimension optional? Also, maybe it should accept only 2 or 3, as no other dimensionalities are supported. And perhaps the description can be made more expressive - like 'model input dimension (2D or 3D).'?

If the user doesn't define the dimension, it is calculated in the validate_patch_size(patch_size, dimension) function in the validators file using the patch size.

Hmm, yes, maybe it should only accept 2 or 3. I will change it and update the description.

@benmalef: where is the function validate_patch_size? I am unable to find it in either the master or your branch.

There is a way to calculate the dimension automatically via ITK, though. But my guess is that this probably should be rolled into the overall cohort characteristics and sanity checks (#956).

@sarthakpati ,Here is the function https://github.com/benmalef/GaNDLF/blob/8837b7ccd25b747c7cbe4faaa77226a631febd85/GANDLF/Configuration/Parameters/validators.py#L132. in my branch.

I tried to manipulate this code using the Pydantic.

GaNDLF/GANDLF/config_manager.py

Line 136 in 3bfe133

if "patch_size" in params:

Understood. I think we still need to ensure that the check for dimension is for 2 and 3 in the validation step itself so that we can give a meaningful error to the user.

szmazurek · 2025-02-18T07:19:00Z

GANDLF/Configuration/Parameters/model_parameters.py

+    )  #  it allows extra fields in the model dict
+    dimension: Optional[int] = Field(description="Dimension.")
+    architecture: Union[ARCHITECTURE_OPTIONS, dict] = Field(description="Architecture.")
+    final_layer: str = Field(description="Final layer.")


Here we are also limited to certain amount of acceptable values - leveraging literal seems like good option.

Yesss, you are right. I change it . :P

szmazurek · 2025-02-18T07:19:39Z

GANDLF/Configuration/Parameters/model_parameters.py

+        ),
+        default=3,
+    )  # TODO: check it
+    type: Optional[str] = Field(description="Type of model.", default="torch")


Should it also be literal? Probably options are torch, openvino? @sarthakpati am I right?

You are right.

szmazurek · 2025-02-18T07:21:03Z

GANDLF/Configuration/Parameters/model_parameters.py

+        default=3,
+    )  # TODO: check it
+    type: Optional[str] = Field(description="Type of model.", default="torch")
+    data_type: str = Field(description="Data type.", default="FP32")


Is this true that we support such field in the config and it really influences anything in base gandlf? I tought that precision is chaning only when amp is enabled

I found it in the config_manager file and tried to manipulate it using pydantic .

GaNDLF/GANDLF/config_manager.py

Line 594 in 3bfe133

if not ("data_type" in params["model"]):

I think @szmazurek is right. This is only changing when amp gets enabled. However, this flag is probably used by an earlier version of OpenVINO. Perhaps you can comment those lines out @benmalef and see if that makes a difference?

szmazurek · 2025-02-18T07:21:24Z

GANDLF/Configuration/Parameters/model_parameters.py

+    type: Optional[str] = Field(description="Type of model.", default="torch")
+    data_type: str = Field(description="Data type.", default="FP32")
+    save_at_every_epoch: bool = Field(default=False, description="Save at every epoch.")
+    amp: bool = Field(default=False, description="Amplifier.")


amp stands for automatic mixed precision, not amplifier

Yess, you are right. Sorry for this, the auto-complete sometimes messes up. I changed it

szmazurek · 2025-02-18T07:23:17Z

GANDLF/Configuration/Parameters/nested_training_parameters.py

+        default=-5,
+        description="this controls the number of validation data folds to be used for model *selection* during training (not used for back-propagation)",
+    )
+    proportional: Optional[bool] = Field(default=None)


what does this parameter do? also, if it's boolean, can't we set default as False?

I don't what does it, but i found it here

GaNDLF/GANDLF/config_manager.py

Line 638 in 3bfe133

params["nested_training"]["stratified"] = params["nested_training"].get(

So, I manipulate it using the def validate_nested_training(self) with the decorator model_validator.
I donlt know what is the default value. We may set the default as False.

Yes, let's set this to False.

szmazurek · 2025-02-18T07:24:09Z

GANDLF/Configuration/Parameters/nested_training_parameters.py

+        description="this will perform stratified k-fold cross-validation but only with offline data splitting",
+    )
+    testing: int = Field(
+        default=-5,


Open question - are there any limits to values that can be set in this field? Like, what happens if I set testing to 10? If there are limits, maybe it's worth to include possible ranges when defining this field, what do you think ? @sarthakpati @benmalef

Hmm I don't know. Again, I found this value in the config_manager

GaNDLF/GANDLF/config_manager.py

Line 642 in 3bfe133

params["nested_training"]["validation"] = params["nested_training"].get(

I believe the only range is the total dataset number. For example, if we have 100 subjects, we cannot create 101 testing/validation folds. But this is an unrealistic example. Perhaps we can set some constraints - 10 seems appropriate.

szmazurek · 2025-02-18T07:27:50Z

GANDLF/Configuration/Parameters/patch_sampler.py

+
+
+class PatchSampler(BaseModel):
+    type: str = Field(default="uniform")


are there any other options available for type and padding_mode? if so, maybe we should use literal here?

Yes, I think this should be a literal. We only support uniform and label [ref].

szmazurek · 2025-02-18T07:28:56Z

GANDLF/Configuration/Parameters/user_defined_parameters.py

+    @model_validator(mode="after")
+    def validate_version(self) -> Self:
+        if version_check(self.model_dump(), version_to_check=version("GANDLF")):
+            return self


should we rise an error here if the condition is not met?

Yes, we will rise an assertion error in the version_check fun.

GaNDLF/GANDLF/utils/generic.py

Line 91 in 3bfe133

def version_check(version_from_config: Dict[str, str], version_to_check: str) -> bool:

Ah, I see, thanks!

szmazurek · 2025-02-18T07:32:24Z

GANDLF/Configuration/Parameters/scheduler_parameters.py

+    )
+    # min_lr: 0.00001, #TODO: this should be defined ??
+    # max_lr: 1, #TODO: this should be defined ??
+    step_size: float = Field(description="step_size", default=None)


I think we need to think on different classes that would allow for definition of params for separate schedulers - for example, if we use a scheduler which does reduce on plateau, then we need a field to define tracked metric. Not really sure how to implement that nicely tho. Other approach is define all possible fields that any scheduler can take and later provide validation logic which takes care of conditionality - i.e if reduce_on_plateau type chosen, then we require monitor field

hmm, if we could define different classes for each scheduler it would be great. We should think about it.

szmazurek · 2025-02-18T07:33:17Z

GANDLF/Configuration/Parameters/user_defined_parameters.py

+class UserDefinedParameters(DefaultParameters):
+    version: Version = Field(
+        default=Version(minimum=version("GANDLF"), maximum=version("GANDLF")),
+        description="Whether weighted loss is to be used or not.",


Descritpion is not valid I believe :P

hahah.... I changed :P

szmazurek · 2025-02-18T07:33:54Z

GANDLF/Configuration/Parameters/user_defined_parameters.py

+    patch_size: Union[list[Union[int, float]], int, float] = Field(
+        description="Patch size."
+    )
+    model: Model = Field(..., description="The model to use. ")


should be list of avaiable strings no?

I took this list

GaNDLF/GANDLF/models/__init__.py

Line 41 in 3bfe133

global_models_dict = {

and I define it as literal in the model architecture parameter.

szmazurek · 2025-02-18T07:35:46Z

GANDLF/Configuration/Parameters/user_defined_parameters.py

+        description="Scheduler.", default=Scheduler(type="triangle_modified")
+    )
+    optimizer: Union[str, Optimizer] = Field(
+        description="Optimizer.", default=Optimizer(type="adam")


Question about naming - here I first assumed we are initializing real torch optimizer - perhaps the names of config classes should be suffixed with Config/params?

For the variable names, right? Absolutely - it would make it clear for developers.

what about Optimizer_config ?

Yup, it should be something along the lines of ${dict_name}_config for everything, even if it is something that isn't used as a variable later (such as optimizer, model, ...). Makes things clear for devs.

GANDLF/Configuration/Parameters/user_defined_parameters.py

szmazurek · 2025-02-18T07:38:53Z

GANDLF/Configuration/Parameters/user_defined_parameters.py

+    data_postprocessing_after_reverse_one_hot_encoding: dict = Field(
+        description="data_postprocessing_after_reverse_one_hot_encoding.", default={}
+    )
+    differential_privacy: Any = Field(description="Differential privacy.", default=None)


Can it be just a boolean field?

This can be either a boolean or a dict:

differential_privacy: max_grad_norm: 0.015625 noise_multiplier: 128.0 physical_batch_size: 64

szmazurek · 2025-02-18T07:39:09Z

GANDLF/Configuration/Parameters/user_defined_parameters.py

+        Field(description="Data preprocessing."),
+        AfterValidator(validate_data_preprocessing),
+    ] = {}
+    # TODO: It should be defined with a better way (using a BaseModel class)


I agree with the comments, it would allow for a lot of clarity

szmazurek · 2025-02-18T07:41:46Z

GANDLF/Configuration/utils.py

+        file.write("\n".join(markdown))
+
+
+def initialize_key(


do we need such utility? Meaning, if there are default parameters to be set, ideally they are defined via pydantic and automatically populated if user did not set them explicitly

Yes, you are right.

The best way is not to use it, but some parameters have complex logic, and I need to figure out how to implement them using Pydantic.

So, I use it for some parameters as they are in the config_manager.

For example, for the data_augmentation parameter, I use it in the validate_data_augmentation(value, patch_size) function in the validators file.

szmazurek · 2025-02-18T07:42:42Z

setup.py

@@ -85,6 +87,7 @@
    "openslide-bin",
    "openslide-python==1.4.1",
    "lion-pytorch==0.2.2",
+    "pydantic",


I would fix version, maybe in the future releases there are going to be some breaking changes of pydantic (little chance but I am paranoid a little)

Done. Updated the Pydantic in the 2.10.6 version.

github-actions · 2025-03-13T17:53:17Z

@check-spelling-bot Report

🔴 Please review

See the 📂 files view, the 📜action log, or 📝 job summary for details.

Unrecognized words (1)

hann

Some files were automatically ignored 🙈

These sample patterns would exclude them:

^\QGANDLF/Configuration/__init__.py\E$
^\QGANDLF/Configuration/Parameters/__init__.py\E$

You should consider adding them to:

.github/actions/spelling/excludes.txt

File matching is via Perl regular expressions.

To check these files, more of their words need to be in the dictionary than not. You can use patterns.txt to exclude portions, add items to the dictionary (e.g. by adding them to allow.txt), or fix typos.

To accept these unrecognized words as correct and update file exclusions, you could run the following commands

... in a clone of the [email protected]:benmalef/GaNDLF.git repository
on the 976-add-pydantic-configuration branch (ℹ️ how do I use this?):

curl -s -S -L 'https://raw.githubusercontent.com/check-spelling/check-spelling/main/apply.pl' |
perl - 'https://github.com/mlcommons/GaNDLF/actions/runs/13841289596/attempts/1'

Available 📚 dictionaries could cover words not in the 📘 dictionary

Dictionary	Entries	Covers	Uniquely
cspell:java/src/java-terms.txt	920	2	2

Consider adding them (in .github/workflows/spellchecker.yml) in jobs:/spelling: for uses: check-spelling/check-spelling@main in its with:

      with:
        extra_dictionaries: |
          cspell:java/src/java-terms.txt

To stop checking additional dictionaries, add (in .github/workflows/spellchecker.yml) for uses: check-spelling/check-spelling@main in its with:

check_extra_dictionaries: ''

Warnings (1)

See the 📂 files view, the 📜action log, or 📝 job summary for details.

⚠️ Warnings	Count
⚠️ binary-file	2

See ⚠️ Event descriptions for more information.

976 add pydantic configuration v1 (#34)

da9014f

add Pydantic configuration

benmalef added 2 commits February 12, 2025 21:18

fix codacy errors

79173fd

fix codacy error eval -> ast.literal_eval()

c5882f8

sarthakpati changed the title ~~Add Pydantic configuration (new)~~ Added configuration management using pydantic Feb 12, 2025

sarthakpati mentioned this pull request Feb 17, 2025

Pydantic config #976

Closed

11 tasks

szmazurek reviewed Feb 18, 2025

View reviewed changes

GANDLF/Configuration/Parameters/user_defined_parameters.py Show resolved Hide resolved

szmazurek reviewed Feb 18, 2025

View reviewed changes

Merge branch 'mlcommons:master' into 976-add-pydantic-configuration

83bbd06

benmalef added 5 commits February 25, 2025 20:10

update the model final_layer parameter with literals

be917cb

change the model amp parameter description

a70c88b

add None in the final_layers_options

71ade3f

define the pydantic version in the setup

8837b7c

add grid_aggregator_overlap literals in the default_parameters model

6a9dcb4