Amanuensis, noun. /əˌmænjuˈensɪs/. Early 17th cent.: Latin, from a manu (short for secretary) and -ensis 'belonging to'.
- a person who writes down your words when you cannot write.
- a literary assistant, especially one who writes, types for somebody or copies text.
Amanuensis 2.0 is a significant upgrade from the original version, focusing specifically on TEI XML processing for early modern abbreviations. This new version provides a modern, modular architecture with enhanced capabilities for working with structured documents.
<<<<<<< HEAD
- XML-Native Processing: Works directly with XML nodes without extracting to plain text, preserving the structure and relationships between elements
- TEI-Aware Handling: Special handling for TEI XML abbreviation structures including
<abbr>
,<g>
, and<am>
elements ======= - TEI XML Processing: Parse TEI XML documents containing early modern abbreviations
fcb6d5e (revert to previous version)
- Smart Suggestion System: Combine dictionary lookups, pattern matching, WordNet, and language models for better expansions
- Interactive Interface: User-friendly command-line interface for reviewing and selecting expansions
- Dataset Collection Mode: Preserve original documents while building training datasets from user selections
- Dataset Creation: Build structured datasets for training language models on abbreviation expansion
- Comprehensive Test Suite: Extensive testing framework to ensure reliability
- Modern Architecture: Modular, maintainable code structure
In version 2.0, we've completely redesigned how TEI documents are processed to preserve structural information:
- Direct XML Manipulation: Work directly with XML nodes instead of extracting to plain text
- Node Relationships: Maintain parent-child relationships between elements
- Structure Preservation: Handle complex TEI structures like
<choice>
,<abbr>
,<expan>
,<am>
,<ex>
properly - Special Element Support: Properly handle special elements like
<g ref="char:cmbAbbrStroke">
for macrons and other early modern abbreviation markers - XPath Navigation: Use XPath for precise element location rather than string searching
Amanuensis is an application designed to accelerate normalization tasks in large historical corpora. It increases legibility by expanding abbreviations and replacing unicode characters in a systematic and context-sensitive way. This type of pre-processing is instrumental to subsequent digital analyses and manipulations.
- Unicode Character Replacement: A powerful conversion tool to clean up text by removing and/or replacing undesirable characters.
- Dynamic Word Normalization: Expanding abbreviated words using Natural Language Processing, human inputs, and Large Language Models.
- Comprehensive Logging: Every single modification is meticulously tracked and stored in accessible json files, enabling further statistical analysis.
- Python 3.10 or higher
- Required packages (install with pip):
- toml
- lxml
- rich
- nltk
-
Clone the repository:
git clone https://github.com/yourusername/amanuensis.git cd amanuensis
-
Install dependencies:
pip install -r requirements.txt
-
Download NLTK data:
python -c "import nltk; nltk.download('wordnet')"
-
(Optional) Set up OpenAI API for enhanced suggestions:
export OPENAI_API_KEY=your_api_key_here
python amanuensis.py
This will launch the interactive interface for the new version.
python amanuensis.py --help
Options:
--config, -c
: Path to configuration file (default: config.toml)--input, -i
: Input directory containing TEI XML files--output, -o
: Output directory for processed files--quiet, -q
: Run in non-interactive mode--verbose, -v
: Enable verbose logging--process, -p
: Process a specific TEI XML file
Process a specific file:
python amanuensis.py --process samples/document.xml
Process all files in a directory:
python amanuensis.py --input /path/to/tei/files --output /path/to/output
For the original version functionality:
./run.sh
Configuration is managed through the config.toml
file. Key settings include:
- Input/output paths
- Language model settings
- User interface preferences
- Dataset creation options
- TEI XML processing settings
See config.toml
for detailed configuration options.
- Multilingual Support: Addition of French, Italian, Latin, and Spanish.
- Beyond OpenAI: Compatibility with competing APIs.
- Documentation: Basic documentation in English, French, and Spanish.
- Web Interface: Develop a web-based interface for easier interaction
Feel free to suggest new features in the Issues section.
See requirements.txt
Contributions are welcome! Please feel free to submit a pull request.
This project is licensed under the terms of the MIT license. For more details, see the LICENSE file.