Add tokenized example debugging during training #1520

vishwamartur · 2025-03-07T16:36:48Z

Related to #1369

Add functionality to log tokenized examples for debugging during training.

Add log_tokenized_example function in src/oumi/builders/collators.py to log raw, formatted, tokenized examples, and model input.
Modify build_data_collator in src/oumi/builders/collators.py to accept a debug parameter and pass it to the collators.
Update build_collator_from_config in src/oumi/builders/collators.py to pass the debug parameter from the config to build_data_collator.
Add a new command-line option --debug-tokenized-example in src/oumi/cli/train.py to enable logging of tokenized examples during training.
Pass the debug flag to the training configuration in src/oumi/cli/train.py.
Modify TextCollatorWithPadding in src/oumi/core/collators/text_collator_with_padding.py to accept a debug parameter and call log_tokenized_example in the __call__ method if debug is set to True.
Modify TextCompletionsCollatorWithPadding in src/oumi/core/collators/text_completions_collator_with_padding.py to accept a debug parameter and call log_tokenized_example in the __call__ method if debug is set to True.

Related to oumi-ai#1369 Add functionality to log tokenized examples for debugging during training. * Add `log_tokenized_example` function in `src/oumi/builders/collators.py` to log raw, formatted, tokenized examples, and model input. * Modify `build_data_collator` in `src/oumi/builders/collators.py` to accept a `debug` parameter and pass it to the collators. * Update `build_collator_from_config` in `src/oumi/builders/collators.py` to pass the `debug` parameter from the config to `build_data_collator`. * Add a new command-line option `--debug-tokenized-example` in `src/oumi/cli/train.py` to enable logging of tokenized examples during training. * Pass the `debug` flag to the training configuration in `src/oumi/cli/train.py`. * Modify `TextCollatorWithPadding` in `src/oumi/core/collators/text_collator_with_padding.py` to accept a `debug` parameter and call `log_tokenized_example` in the `__call__` method if `debug` is set to `True`. * Modify `TextCompletionsCollatorWithPadding` in `src/oumi/core/collators/text_completions_collator_with_padding.py` to accept a `debug` parameter and call `log_tokenized_example` in the `__call__` method if `debug` is set to `True`.

wizeng23 · 2025-03-07T23:04:56Z

Looping in Jeremy, who logged the issue, and Panos, who's the current on call.

nikg4 · 2025-03-07T23:30:53Z

@vishwamartur Please auto-format your changes: https://github.com/oumi-ai/oumi/blob/main/CONTRIBUTING.md#pull-request-pr-guidelines
it's a pre-condition for presubmit

optas · 2025-03-07T23:34:31Z

src/oumi/core/collators/text_collator_with_padding.py

@@ -223,6 +227,15 @@ def __call__(self, batch) -> dict[str, Any]:
        if labels_on:
            combined_batch[_LABELS_KEY] = collated_text_inputs[_LABELS_KEY]

+        if self._debug:


More of a design concern: having a debug example logged in each batch seems excessive.

+1, let's only log this for the first sample of the first batch.

optas · 2025-03-07T23:37:33Z

src/oumi/core/collators/text_collator_with_padding.py

@@ -223,6 +227,15 @@ def __call__(self, batch) -> dict[str, Any]:
        if labels_on:
            combined_batch[_LABELS_KEY] = collated_text_inputs[_LABELS_KEY]

+        if self._debug:
+            raw_example = batch[0]
+            formatted_example = tokenizer.apply_chat_template(raw_example, tokenize=False)


is it obvious at this point that the tokenizer has a chat_template under all use cases?

+1, there's no guarantee this collator has a chat template.

Also, we've already tokenized the example, so a better choice would be to simply take the input ids of the combined_batch's first element and tokenizer.decode the input ids into their respective strings.

It would also be useful to log all the elements of the first batch (i.e. the masks and labels as well)

We don't necessarily need to decode the labels or mask though, just the input ids.

optas · 2025-03-07T23:38:13Z

src/oumi/core/collators/text_completions_collator_with_padding.py

@@ -64,4 +69,13 @@ def __call__(self, batch) -> dict[str, Any]:
        # Collate batch prompts.
        collated_text_inputs = self._collate(batch)

+        if self._debug:
+            raw_example = batch[0]
+            formatted_example = self._tokenizer.apply_chat_template(raw_example, tokenize=False)


optas

Hi @vishwamartur ! Thank you very much for your contribution! I left some quick feedback. Obviously, also the code needs fixing at the minimum to pass the currently failing checks.

More broadly (+ @jgreer013 ) who opened the issue:

I think logging an example within each batch formed is excessive. What do you think?
It would be great to create a slightly more generic function wrapper that is called once and acts based on collator/tokenizer (e.g., does it include a chat template, is it vision+text,...) when the data is being prepared.
- Minor: the explicit debug variable (in each/any) collator could it be used in the future for more debugging purposes? If not, and if it is kept, perhaps make the name more explicit to the actions (log_example...).

jgreer013 · 2025-03-11T16:38:24Z

src/oumi/core/collators/text_collator_with_padding.py

@@ -223,6 +227,15 @@ def __call__(self, batch) -> dict[str, Any]:
        if labels_on:
            combined_batch[_LABELS_KEY] = collated_text_inputs[_LABELS_KEY]

+        if self._debug:


+1, let's only log this for the first sample of the first batch.

jgreer013 · 2025-03-11T16:44:39Z

src/oumi/core/collators/text_collator_with_padding.py

@@ -223,6 +227,15 @@ def __call__(self, batch) -> dict[str, Any]:
        if labels_on:
            combined_batch[_LABELS_KEY] = collated_text_inputs[_LABELS_KEY]

+        if self._debug:
+            raw_example = batch[0]
+            formatted_example = tokenizer.apply_chat_template(raw_example, tokenize=False)


+1, there's no guarantee this collator has a chat template.

Also, we've already tokenized the example, so a better choice would be to simply take the input ids of the combined_batch's first element and tokenizer.decode the input ids into their respective strings.

It would also be useful to log all the elements of the first batch (i.e. the masks and labels as well)

We don't necessarily need to decode the labels or mask though, just the input ids.

jgreer013 · 2025-03-11T16:45:23Z

src/oumi/core/collators/text_completions_collator_with_padding.py

@@ -64,4 +69,13 @@ def __call__(self, batch) -> dict[str, Any]:
        # Collate batch prompts.
        collated_text_inputs = self._collate(batch)

+        if self._debug:
+            raw_example = batch[0]
+            formatted_example = self._tokenizer.apply_chat_template(raw_example, tokenize=False)


nikg4 requested review from taenin, oelachqar, nikg4 and wizeng23 March 7, 2025 22:32

wizeng23 requested review from optas and jgreer013 March 7, 2025 23:04

optas reviewed Mar 7, 2025

View reviewed changes

optas requested changes Mar 7, 2025

View reviewed changes

jgreer013 requested changes Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenized example debugging during training #1520

Add tokenized example debugging during training #1520

vishwamartur commented Mar 7, 2025

wizeng23 commented Mar 7, 2025

nikg4 commented Mar 7, 2025 •

edited

Loading

optas Mar 7, 2025

jgreer013 Mar 11, 2025

optas Mar 7, 2025

jgreer013 Mar 11, 2025

optas Mar 7, 2025

jgreer013 Mar 11, 2025

optas left a comment

jgreer013 Mar 11, 2025

jgreer013 Mar 11, 2025

jgreer013 Mar 11, 2025

Add tokenized example debugging during training #1520

Are you sure you want to change the base?

Add tokenized example debugging during training #1520

Conversation

vishwamartur commented Mar 7, 2025

wizeng23 commented Mar 7, 2025

nikg4 commented Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

optas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikg4 commented Mar 7, 2025 •

edited

Loading