Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HuggingFaceM4/Docmatix dataset #1342

Merged
merged 5 commits into from
Feb 4, 2025
Merged

Conversation

vishwamartur
Copy link
Contributor

@vishwamartur vishwamartur commented Feb 2, 2025

Related to #915

Add support for the HuggingFaceM4/Docmatix dataset for Document Visual Question Answering.

  • New Dataset Class

    • Add DocmatixDataset class in src/oumi/datasets/vision_language/docmatix.py to handle the HuggingFaceM4/Docmatix dataset.
    • Implement transform_conversation method to convert raw data into a Conversation object.
    • Register the dataset using the register_dataset decorator.
  • Configuration Updates

    • Update configs/recipes/vision/llava_7b/sft/train.yaml to include the HuggingFaceM4/Docmatix dataset under data.train.datasets.
    • Update configs/recipes/vision/phi3/sft/train.yaml to include the HuggingFaceM4/Docmatix dataset under data.train.datasets.
    • Update configs/recipes/vision/qwen2_vl_2b/sft/train.yaml to include the HuggingFaceM4/Docmatix dataset under data.train.datasets.
    • Update configs/recipes/vision/smolvlm/sft/train.yaml to include the HuggingFaceM4/Docmatix dataset under data.train.datasets.

Towards OPE-747

Related to oumi-ai#915

Add support for the HuggingFaceM4/Docmatix dataset for Document Visual Question Answering.

* **New Dataset Class**
  - Add `DocmatixDataset` class in `src/oumi/datasets/vision_language/docmatix.py` to handle the HuggingFaceM4/Docmatix dataset.
  - Implement `transform_conversation` method to convert raw data into a `Conversation` object.
  - Register the dataset using the `register_dataset` decorator.

* **Configuration Updates**
  - Update `configs/recipes/vision/llava_7b/sft/train.yaml` to include the HuggingFaceM4/Docmatix dataset under `data.train.datasets`.
  - Update `configs/recipes/vision/phi3/sft/train.yaml` to include the HuggingFaceM4/Docmatix dataset under `data.train.datasets`.
  - Update `configs/recipes/vision/qwen2_vl_2b/sft/train.yaml` to include the HuggingFaceM4/Docmatix dataset under `data.train.datasets`.
  - Update `configs/recipes/vision/smolvlm/sft/train.yaml` to include the HuggingFaceM4/Docmatix dataset under `data.train.datasets`.
@taenin taenin requested review from nikg4 and optas February 2, 2025 15:47
@nikg4 nikg4 added the good first issue Good for newcomers label Feb 2, 2025
nikg4 and others added 4 commits February 3, 2025 18:09
* **configs/recipes/vision/llava_7b/sft/train.yaml**
  - Add HuggingFaceM4/Docmatix dataset under `data.train.datasets` and comment it out

* **configs/recipes/vision/phi3/sft/train.yaml**
  - Add HuggingFaceM4/Docmatix dataset under `data.train.datasets` and comment it out

* **configs/recipes/vision/qwen2_vl_2b/sft/train.yaml**
  - Add HuggingFaceM4/Docmatix dataset under `data.train.datasets` and comment it out

* **configs/recipes/vision/smolvlm/sft/train.yaml**
  - Add HuggingFaceM4/Docmatix dataset under `data.train.datasets` and comment it out
@nikg4
Copy link
Collaborator

nikg4 commented Feb 4, 2025

Thank you @vishwamartur !

@nikg4 nikg4 merged commit 349e47b into oumi-ai:main Feb 4, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants