Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenized example debugging during training #1520

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vishwamartur
Copy link
Contributor

Related to #1369

Add functionality to log tokenized examples for debugging during training.

  • Add log_tokenized_example function in src/oumi/builders/collators.py to log raw, formatted, tokenized examples, and model input.
  • Modify build_data_collator in src/oumi/builders/collators.py to accept a debug parameter and pass it to the collators.
  • Update build_collator_from_config in src/oumi/builders/collators.py to pass the debug parameter from the config to build_data_collator.
  • Add a new command-line option --debug-tokenized-example in src/oumi/cli/train.py to enable logging of tokenized examples during training.
  • Pass the debug flag to the training configuration in src/oumi/cli/train.py.
  • Modify TextCollatorWithPadding in src/oumi/core/collators/text_collator_with_padding.py to accept a debug parameter and call log_tokenized_example in the __call__ method if debug is set to True.
  • Modify TextCompletionsCollatorWithPadding in src/oumi/core/collators/text_completions_collator_with_padding.py to accept a debug parameter and call log_tokenized_example in the __call__ method if debug is set to True.

Related to oumi-ai#1369

Add functionality to log tokenized examples for debugging during training.

* Add `log_tokenized_example` function in `src/oumi/builders/collators.py` to log raw, formatted, tokenized examples, and model input.
* Modify `build_data_collator` in `src/oumi/builders/collators.py` to accept a `debug` parameter and pass it to the collators.
* Update `build_collator_from_config` in `src/oumi/builders/collators.py` to pass the `debug` parameter from the config to `build_data_collator`.
* Add a new command-line option `--debug-tokenized-example` in `src/oumi/cli/train.py` to enable logging of tokenized examples during training.
* Pass the `debug` flag to the training configuration in `src/oumi/cli/train.py`.
* Modify `TextCollatorWithPadding` in `src/oumi/core/collators/text_collator_with_padding.py` to accept a `debug` parameter and call `log_tokenized_example` in the `__call__` method if `debug` is set to `True`.
* Modify `TextCompletionsCollatorWithPadding` in `src/oumi/core/collators/text_completions_collator_with_padding.py` to accept a `debug` parameter and call `log_tokenized_example` in the `__call__` method if `debug` is set to `True`.
@nikg4 nikg4 requested review from taenin, oelachqar, nikg4 and wizeng23 March 7, 2025 22:32
@wizeng23 wizeng23 requested review from optas and jgreer013 March 7, 2025 23:04
@wizeng23
Copy link
Contributor

wizeng23 commented Mar 7, 2025

Looping in Jeremy, who logged the issue, and Panos, who's the current on call.

@nikg4
Copy link
Collaborator

nikg4 commented Mar 7, 2025

@vishwamartur Please auto-format your changes: https://github.com/oumi-ai/oumi/blob/main/CONTRIBUTING.md#pull-request-pr-guidelines
it's a pre-condition for presubmit

@@ -223,6 +227,15 @@ def __call__(self, batch) -> dict[str, Any]:
if labels_on:
combined_batch[_LABELS_KEY] = collated_text_inputs[_LABELS_KEY]

if self._debug:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a design concern: having a debug example logged in each batch seems excessive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's only log this for the first sample of the first batch.

@@ -223,6 +227,15 @@ def __call__(self, batch) -> dict[str, Any]:
if labels_on:
combined_batch[_LABELS_KEY] = collated_text_inputs[_LABELS_KEY]

if self._debug:
raw_example = batch[0]
formatted_example = tokenizer.apply_chat_template(raw_example, tokenize=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it obvious at this point that the tokenizer has a chat_template under all use cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, there's no guarantee this collator has a chat template.

Also, we've already tokenized the example, so a better choice would be to simply take the input ids of the combined_batch's first element and tokenizer.decode the input ids into their respective strings.

It would also be useful to log all the elements of the first batch (i.e. the masks and labels as well)

We don't necessarily need to decode the labels or mask though, just the input ids.

@@ -64,4 +69,13 @@ def __call__(self, batch) -> dict[str, Any]:
# Collate batch prompts.
collated_text_inputs = self._collate(batch)

if self._debug:
raw_example = batch[0]
formatted_example = self._tokenizer.apply_chat_template(raw_example, tokenize=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@optas optas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vishwamartur ! Thank you very much for your contribution! I left some quick feedback. Obviously, also the code needs fixing at the minimum to pass the currently failing checks.

More broadly (+ @jgreer013 ) who opened the issue:

  1. I think logging an example within each batch formed is excessive. What do you think?
  2. It would be great to create a slightly more generic function wrapper that is called once and acts based on collator/tokenizer (e.g., does it include a chat template, is it vision+text,...) when the data is being prepared.
    - Minor: the explicit debug variable (in each/any) collator could it be used in the future for more debugging purposes? If not, and if it is kept, perhaps make the name more explicit to the actions (log_example...).

@@ -223,6 +227,15 @@ def __call__(self, batch) -> dict[str, Any]:
if labels_on:
combined_batch[_LABELS_KEY] = collated_text_inputs[_LABELS_KEY]

if self._debug:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's only log this for the first sample of the first batch.

@@ -223,6 +227,15 @@ def __call__(self, batch) -> dict[str, Any]:
if labels_on:
combined_batch[_LABELS_KEY] = collated_text_inputs[_LABELS_KEY]

if self._debug:
raw_example = batch[0]
formatted_example = tokenizer.apply_chat_template(raw_example, tokenize=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, there's no guarantee this collator has a chat template.

Also, we've already tokenized the example, so a better choice would be to simply take the input ids of the combined_batch's first element and tokenizer.decode the input ids into their respective strings.

It would also be useful to log all the elements of the first batch (i.e. the masks and labels as well)

We don't necessarily need to decode the labels or mask though, just the input ids.

@@ -64,4 +69,13 @@ def __call__(self, batch) -> dict[str, Any]:
# Collate batch prompts.
collated_text_inputs = self._collate(batch)

if self._debug:
raw_example = batch[0]
formatted_example = self._tokenizer.apply_chat_template(raw_example, tokenize=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants