Skip to content

PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books. The project has just started.

License

Notifications You must be signed in to change notification settings

oomol-lab/pdf-craft

Repository files navigation

pdf-craft

English | 中文

Introduction

PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books. The project has just started. If you encounter any problems or have any suggestions, please submit issues.

About PDF craft

This project can read PDF pages one by one, and use DocLayout-YOLO mixed with an algorithm I wrote to extract the text from the book pages and filter out elements such as headers, footers, footnotes, and page numbers. In the process of crossing pages, the algorithm will be used to properly handle the problem of the connection between the previous and next pages, and finally generate semantically coherent text. The book pages will use OnnxOCR for text recognition. And use layoutreader to determine the reading order that conforms to human habits.

With only these AI models that can be executed locally (using local graphics devices to accelerate), PDF files can be converted to Markdown format. This is suitable for papers or small books.

However, if you want to parse books (generally more than 100 pages), it is recommended to convert them to EPUB format files. During the conversion process, this library will pass the data recognized by the local OCR to LLM, and build the structure of the book through specific information (such as the table of contents), and finally generate an EPUB file with a table of contents and chapters. During this parsing and construction process, the annotations and citations information of each page will be read through LLM, and then presented in the new format in the EPUB file. In addition, LLM can correct OCR errors to a certain extent. This step cannot be performed entirely locally. You need to configure the LLM service. It is recommended to use DeepSeek. The prompt of this library is based on V3 model testing.

Installation

You need python 3.10 or above (recommended 3.10.16).

pip install pdf-craft

Using CUDA

If you want to use GPU acceleration, you need to ensure that your device is ready for the CUDA environment. Please refer to the introduction of PyTorch and select the appropriate command installation according to your operating system installation.

Function

Convert PDF to MarkDown

This operation does not require calling a remote LLM, and can be completed with local computing power (CPU or graphics card). The required model will be downloaded online when it is called for the first time. When encountering illustrations, tables, and formulas in the document, screenshots will be directly inserted into the MarkDown file.

from pdf_craft import PDFPageExtractor, MarkDownWriter

extractor = PDFPageExtractor(
  device="cpu", # If you want to use CUDA, please change to device="cuda:0" format.
  model_dir_path="/path/to/model/dir/path", # The folder address where the AI ​​model is downloaded and installed
)
with MarkDownWriter(markdown_path, "images", "utf-8") as md:
  for block in extractor.extract(pdf="/path/to/pdf/file"):
    md.write(block)

After the execution is completed, a *.md file will be generated at the specified path. If there are illustrations (or tables, formulas) in the original PDF, an assets directory will be created at the same level as *.md to save the images. The images in the assets directory will be referenced in the MarkDown file in the form of relative addresses.

The conversion effect is as follows.

Convert PDF to EPUB

The first half of this operation is the same as Convert PDF to MarkDown (see the previous section). OCR will be used to scan and recognize text from PDF. Therefore, you also need to build a PDFPageExtractor object first.

from pdf_craft import PDFPageExtractor

extractor = PDFPageExtractor(
  device="cpu", # If you want to use CUDA, please change to device="cuda:0" format.
  model_dir_path="/path/to/model/dir/path", # The folder address where the AI ​​model is downloaded and installed
)

After that, you need to configure the LLM object. It is recommended to use DeepSeek. The prompt of this library is based on V3 model testing.

from pdf_craft import LLM

llm = LLM(
  key="sk-XXXXX", # key provided by LLM vendor
  url="https://api.deepseek.com", # URL provided by LLM vendor
  model="deepseek-chat", # model provided by LLM vendor
  token_encoding="o200k_base", # local model name for tokens estimation (not related to LLM, if you don't care, keep "o200k_base")
)

After the above two objects are prepared, you can start scanning and analyzing PDF books.

from pdf_craft import analyse

analyse(
  llm=llm, # LLM configuration prepared in the previous step
  pdf_page_extractor=pdf_page_extractor, # PDFPageExtractor object prepared in the previous step
  pdf_path="/path/to/pdf/file", # PDF file path
  analysing_dir_path="/path/to/analysing/dir", # analysing directory path
  output_dir_path="/path/to/output/files", # The analysis results will be written to this directory
)

Note the two directory paths in the above code. One is output_dir_path, which indicates the folder where the scan and analysis results (there will be multiple files) should be saved. The paths should point to an empty directory. If it does not exist, a directory will be created automatically.

The second is analysing_dir_path, which is used to store the intermediate status during the analysis process. After successful scanning and analysis, this directory and its files will become useless (you can delete them with code). The path should point to a directory. If it does not exist, a directory will be created automatically. This directory (and its files) can save the analysis progress. If an analysis is interrupted due to an accident, you can configure analysing_dir_path to the analysing folder generated by the last interruption, so as to resume and continue the analysis from the last interruption point. In particular, if you want to start a new task, please manually delete or empty the analysing_dir_path directory to avoid accidentally triggering the interruption recovery function.

After the analysis is completed, pass the output_dir_path to the following code as a parameter to finally generate the EPUB file.

from pdf_craft import generate_epub_file

generate_epub_file(
  from_dir_path=output_dir_path, # from the folder generated by the previous step
  epub_file_path="/path/to/output/epub", # generated EPUB file save path
)

This step will divide the chapters in the EPUB according to the previously analyzed book structure and match the appropriate directory structure. In addition, the original annotations and citations at the bottom of the book page will be presented in the EPUB in an appropriate way.

Acknowledgements

About

PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books. The project has just started.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published