Amanuensis 2.0: TEI XML Abbreviation Expansion Tool

Amanuensis, noun. /əˌmænjuˈensɪs/. Early 17th cent.: Latin, from a manu (short for secretary) and -ensis 'belonging to'.

a person who writes down your words when you cannot write.
a literary assistant, especially one who writes, types for somebody or copies text.

Amanuensis 2.0: New Version

Amanuensis 2.0 is a significant upgrade from the original version, focusing specifically on TEI XML processing for early modern abbreviations. This new version provides a modern, modular architecture with enhanced capabilities for working with structured documents.

New Features in Version 2.0

<<<<<<< HEAD

XML-Native Processing: Works directly with XML nodes without extracting to plain text, preserving the structure and relationships between elements
TEI-Aware Handling: Special handling for TEI XML abbreviation structures including <abbr>, <g>, and <am> elements =======
TEI XML Processing: Parse TEI XML documents containing early modern abbreviations

fcb6d5e (revert to previous version)

Smart Suggestion System: Combine dictionary lookups, pattern matching, WordNet, and language models for better expansions
Interactive Interface: User-friendly command-line interface for reviewing and selecting expansions
Dataset Collection Mode: Preserve original documents while building training datasets from user selections
Dataset Creation: Build structured datasets for training language models on abbreviation expansion
Comprehensive Test Suite: Extensive testing framework to ensure reliability
Modern Architecture: Modular, maintainable code structure

XML-Native Processing Approach

In version 2.0, we've completely redesigned how TEI documents are processed to preserve structural information:

Direct XML Manipulation: Work directly with XML nodes instead of extracting to plain text
Node Relationships: Maintain parent-child relationships between elements
Structure Preservation: Handle complex TEI structures like <choice>, <abbr>, <expan>, <am>, <ex> properly
Special Element Support: Properly handle special elements like <g ref="char:cmbAbbrStroke"> for macrons and other early modern abbreviation markers
XPath Navigation: Use XPath for precise element location rather than string searching

Original Features

Amanuensis is an application designed to accelerate normalization tasks in large historical corpora. It increases legibility by expanding abbreviations and replacing unicode characters in a systematic and context-sensitive way. This type of pre-processing is instrumental to subsequent digital analyses and manipulations.

Unicode Character Replacement: A powerful conversion tool to clean up text by removing and/or replacing undesirable characters.
Dynamic Word Normalization: Expanding abbreviated words using Natural Language Processing, human inputs, and Large Language Models.
Comprehensive Logging: Every single modification is meticulously tracked and stored in accessible json files, enabling further statistical analysis.

Installation

Prerequisites

Python 3.10 or higher
Required packages (install with pip):
- toml
- lxml
- rich
- nltk

Setup

Clone the repository:

git clone https://github.com/yourusername/amanuensis.git
cd amanuensis

Install dependencies:
```
pip install -r requirements.txt
```

Download NLTK data:

python -c "import nltk; nltk.download('wordnet')"

(Optional) Set up OpenAI API for enhanced suggestions:
```
export OPENAI_API_KEY=your_api_key_here
```

Usage

Using Amanuensis 2.0

python amanuensis.py

This will launch the interactive interface for the new version.

Command Line Options

python amanuensis.py --help

Options:

--config, -c: Path to configuration file (default: config.toml)
--input, -i: Input directory containing TEI XML files
--output, -o: Output directory for processed files
--quiet, -q: Run in non-interactive mode
--verbose, -v: Enable verbose logging
--process, -p: Process a specific TEI XML file

Examples

Process a specific file:

python amanuensis.py --process samples/document.xml

Process all files in a directory:

python amanuensis.py --input /path/to/tei/files --output /path/to/output

Using the Original Version

For the original version functionality:

./run.sh

Configuration

Configuration is managed through the config.toml file. Key settings include:

Input/output paths
Language model settings
User interface preferences
Dataset creation options
TEI XML processing settings

See config.toml for detailed configuration options.

Roadmap

Multilingual Support: Addition of French, Italian, Latin, and Spanish.
Beyond OpenAI: Compatibility with competing APIs.
Documentation: Basic documentation in English, French, and Spanish.
Web Interface: Develop a web-based interface for easier interaction

Feel free to suggest new features in the Issues section.

Dependencies

See requirements.txt

Contributing

Contributions are welcome! Please feel free to submit a pull request.

License

This project is licensed under the terms of the MIT license. For more details, see the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
.github/workflows		.github/workflows
.idea		.idea
data		data
logs		logs
modules		modules
output		output
samples		samples
tests		tests
tmp		tmp
tools		tools
.coverage		.coverage
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
amanuensis.py		amanuensis.py
config.toml		config.toml
pytest.ini		pytest.ini
qodana.yaml		qodana.yaml
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amanuensis 2.0: TEI XML Abbreviation Expansion Tool

Amanuensis 2.0: New Version

New Features in Version 2.0

XML-Native Processing Approach

Original Features

Installation

Prerequisites

Setup

Usage

Using Amanuensis 2.0

Command Line Options

Examples

Using the Original Version

Configuration

Roadmap

Dependencies

Contributing

License

About

Releases

Packages

Languages

Pantagrueliste/Amanuensis

Folders and files

Latest commit

History

Repository files navigation

Amanuensis 2.0: TEI XML Abbreviation Expansion Tool

Amanuensis 2.0: New Version

New Features in Version 2.0

XML-Native Processing Approach

Original Features

Installation

Prerequisites

Setup

Usage

Using Amanuensis 2.0

Command Line Options

Examples

Using the Original Version

Configuration

Roadmap

Dependencies

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages