[references] Recognition - Allow built-in datasets usage #1904

sarjil77 · 2025-03-22T20:37:16Z

now we are able to use the builtin dataset, and it is working fine.

sarjil77 · 2025-03-22T20:44:17Z

@felixdittrich92, here i am able to use the builtin dataset but also facing one issue:

getting this type of error in some of the dataset, it is workign fine for the SVNH but for others which have space between words are facing this kind of issue.

ValueError: some characters cannot be found in 'vocab'.                          Please check the input string ACUTE CADCOVASCULAR and the vocabulary 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ

and the error is because of input_string: ACUTE CADCOVASCULAR, so space between them causes the error and when i tried to add the space in the vocab.

i got error like regarding:
the vocabulary size in your model does not match the pre-trained checkpoint.

please have a look on this. :)

codecov · 2025-03-22T21:19:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.55%. Comparing base (18b8db9) to head (f4e9d0a).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1904      +/-   ##
==========================================
- Coverage   96.65%   96.55%   -0.11%     
==========================================
  Files         166      166              
  Lines        7991     8000       +9     
==========================================
  Hits         7724     7724              
- Misses        267      276       +9

Flag	Coverage Δ
unittests	`96.55% <100.00%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

felixdittrich92 · 2025-03-22T22:27:47Z

@felixdittrich92, here i am able to use the builtin dataset but also facing one issue:

getting this type of error in some of the dataset, it is workign fine for the SVNH but for others which have space between words are facing this kind of issue.
ValueError: some characters cannot be found in 'vocab'.                          Please check the input string ACUTE CADCOVASCULAR and the vocabulary 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~°£€¥¢฿àâéèêëîïôùûüçÀÂÉÈÊËÎÏÔÙÛÜÇ
and the error is because of input_string: ACUTE CADCOVASCULAR, so space between them causes the error and when i tried to add the space in the vocab.

i got error like regarding: the vocabulary size in your model does not match the pre-trained checkpoint.

please have a look on this. :)

Thanks I will have a look on monday 👍🏼
I think we should filter the possible choices here because some datasets like SROIE contains partial word group annotations (including whitespaces) instead of single word annotations.

sarjil77 · 2025-03-23T06:48:49Z

yes, that sounds good,
and the issue is only becuase of partial word group annotations (including whitespaces).

OR we can include the whitespace in the vocab.

let me know for other further changes required here.

felixdittrich92 · 2025-03-24T09:10:13Z

As discussed some updates required :)

.github/workflows/docs.yml

doctr/__init__.py

doctr/datasets/synthtext.py

references/recognition/train_pytorch.py

felixdittrich92 · 2025-03-26T11:28:45Z

Related to: #1830

felixdittrich92

One last thing 👍

doctr/__init__.py

felixdittrich92

Thanks 🤗 Looks good now 👍

felixdittrich92 requested changes Mar 25, 2025

View reviewed changes

felixdittrich92 self-assigned this Mar 26, 2025

felixdittrich92 added this to the 0.12.0 milestone Mar 26, 2025

felixdittrich92 changed the title ~~fixed the #1830 able to use builtin datsset~~ [references] Recognition - Allow built-in datasets usage Mar 26, 2025

Juneja Sarjil and others added 7 commits March 26, 2025 17:18

i trying with doctr

22b2260

mindee#1830 fixed: able to use the builtin dataset

f428cbb

fixed some typos

a59dd35

added the changes for whitespace in multiple files

fdc06d7

adding files with style checked

b82d7c4

further changes

1f896e1

rebase

5d4e489

sarjil77 force-pushed the feature/add-svnh-support branch from 8e4b385 to 5d4e489 Compare March 26, 2025 11:49

felixdittrich92 requested changes Mar 26, 2025

View reviewed changes

doctr/__init__.py Outdated Show resolved Hide resolved

deleting last empty line from version

93dee95

felixdittrich92 mentioned this pull request Mar 26, 2025

Cannot use builtin datasets for detection training #1830

Open

restoring last empty line

f4e9d0a

felixdittrich92 approved these changes Mar 26, 2025

View reviewed changes

felixdittrich92 merged commit fffcccc into mindee:main Mar 26, 2025
67 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[references] Recognition - Allow built-in datasets usage #1904

[references] Recognition - Allow built-in datasets usage #1904

sarjil77 commented Mar 22, 2025

sarjil77 commented Mar 22, 2025 •

edited

Loading

codecov bot commented Mar 22, 2025 •

edited

Loading

felixdittrich92 commented Mar 22, 2025

sarjil77 commented Mar 23, 2025

felixdittrich92 commented Mar 24, 2025

felixdittrich92 commented Mar 26, 2025

felixdittrich92 left a comment

felixdittrich92 left a comment

[references] Recognition - Allow built-in datasets usage #1904

[references] Recognition - Allow built-in datasets usage #1904

Conversation

sarjil77 commented Mar 22, 2025

sarjil77 commented Mar 22, 2025 • edited Loading

codecov bot commented Mar 22, 2025 • edited Loading

Codecov Report

felixdittrich92 commented Mar 22, 2025

sarjil77 commented Mar 23, 2025

felixdittrich92 commented Mar 24, 2025

felixdittrich92 commented Mar 26, 2025

felixdittrich92 left a comment

Choose a reason for hiding this comment

felixdittrich92 left a comment

Choose a reason for hiding this comment

sarjil77 commented Mar 22, 2025 •

edited

Loading

codecov bot commented Mar 22, 2025 •

edited

Loading