LLM Extraction of Interpretable Features from Text

This repository contains the code and supplementary materials for the scientific article "LLM Extraction of Interpretable Features from Text." The aim of this project is to demonstrate how large language models (LLMs) can be used to extract interpretable features from textual data. We further demonstrate the use of these interpretable features with action rules.

Introduction

Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text. We demonstrate this process on two datasets (CORD-19 and M17+) containing several thousand scientific articles from multiple disciplines and a target being a proxy for research impact. An evaluation based on testing for the statistically significant correlation with research impact has shown that LLama 2-generated features are semantically meaningful. We consequently used these generated features in text classification to predict the binary target variable representing the citation rate for the CORD-19 dataset and the ordinal 5-class target representing an expert-awarded grade in the M17+ dataset. Machine-learning models trained on the LLM-generated features provided similar predictive performance to the state-of-the-art embedding model SciBERT for scientific text. Not only did the LLM use only 62 features compared to 768 features in SciBERT embeddings, but these features were fully directly interpretable, as they corresponded to notions such as article methodological rigour, novelty, or grammatical correctness. Consequently, we apply action rule mining resulting in a small number of well-interpretable rules. Consistently competitive results obtained with the same LLM feature set across both thematically diverse datasets show that this approach generalizes across domains. We also assume this technique could be used not only in rule learning but also in other white-box methods. Our results are replicable due to the use of open LLM.

Installation

To get started, clone this repository and install the necessary dependencies:

git clone https://github.com/vojtech-balek/llm-features.git
cd llm-features
pip install -r requirements.txt

Experiments

Data

Data is stored in data folder.

Feature Extraction

Feature extraction as described in Methodology. The corresponding notebook is feature_extraction.ipynb.

Feature Analysis

Analysis of the features generated for the CORD-19 and M17+ datasets. Formal test of the relationship between target and generatef features feature_interpretation.ipynb.

Model Evaluation

Evaluate the performance of the models and the extracted features using model_evaluation.ipynb.

Action Rules

Mining of the action rules for CORD-19 and M17+ datasets action-CORD19.ipynb and action-M17Plus.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
notebooks		notebooks
utils		utils
README.md		README.md
ford_categories.json		ford_categories.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Extraction of Interpretable Features from Text

Table of Contents

Introduction

Installation

Experiments

Data

Feature Extraction

Feature Analysis

Model Evaluation

Action Rules

License

About

Releases

Packages

Contributors 2

Languages

vojtech-balek/llm-features

Folders and files

Latest commit

History

Repository files navigation

LLM Extraction of Interpretable Features from Text

Table of Contents

Introduction

Installation

Experiments

Data

Feature Extraction

Feature Analysis

Model Evaluation

Action Rules

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages