Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text_col param that's required for SFTTrainer #66

Merged
merged 2 commits into from
Jun 7, 2024
Merged

Conversation

wizeng23
Copy link
Contributor

@wizeng23 wizeng23 commented Jun 7, 2024

  • Replaced the usage of data.trainer_kwargs["dataset_text_field"] with data.text_col since it's a required field for SFTTrainer
  • Changed the default TrainerType to HF. IMO this is more logical as a default value. More practically, I made text_col a required field for SFTTrainer, making it not possible to do TrainingConfig() in tests.

@wizeng23 wizeng23 requested review from optas, oelachqar and nikg4 June 7, 2024 07:30
@wizeng23
Copy link
Contributor Author

wizeng23 commented Jun 7, 2024

I feel like my use of the post_init to verify the dataclass values could be a code smell (since there's now a decent amount of logic and it no longer feels like a struct). Happy to take suggestions if others feel the same way!

@oelachqar
Copy link
Contributor

I feel like my use of the post_init to verify the dataclass values could be a code smell (since there's now a decent amount of logic and it no longer feels like a struct). Happy to take suggestions if others feel the same way!

It's not too bad -- Dataclasses are more than simple structs and post init is intended to catch user errors early so what you have is great IMO.

"\n",
"This only needs to be done once!"
"This only needs to be done once! However, you have to wait for the token request to be approved for this to work."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for some reason I don't get any notifications when someone requests a token, could add a note about slacking me or Nikolai to have that approved quickly?

"dataset_text_field"
)
if (
existing_dataset_text_field is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider adding parens around sub-expressions here (so readers don't have to think about and vs not operations precedence)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, though it ends up looking a bit weird due to the formatter.

@@ -58,3 +62,32 @@ def test_custom_train():
)

train(config)


def test_pack_train():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep pre-commit costs under control, we should probably mark this test optional

IIUC, this can be done as follows: pytest.mark.foo

For example, you can add this: @pytest.mark.e2e
then this test will not run by default, only if you pass e2e parameter to pytest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we'll have to figure out how to run e2e tests periodically (outside of precommit flow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as skip for now since it's way too slow. After I resolve #62 I'll change it to e2e

@wizeng23 wizeng23 linked an issue Jun 7, 2024 that may be closed by this pull request
@wizeng23 wizeng23 merged commit 03ce26e into main Jun 7, 2024
1 check passed
@wizeng23 wizeng23 deleted the wizeng/pretrain branch June 7, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

trainer_kwargs: avoid listing in config.data
3 participants