Tech News RAG-based Chatbot

This project consists of a proof of concept application for users to quickly obtain answers to queries on AI developments, supported by sources such as papers or blogs.

Overall System

Data Ingestion

The main components used here are the Qdrant vectorDB and the HuggingFace transformers library. The arXiv API serves as the data source, although other data sources are easy to add in the future. Information for the queried papers is cleaned, tokenized, and embedded using a pre-trained sentence transformer model from HuggingFace. The embeddings are uploaded into the Qdrant vectorDB, along with relevant metadata. These components are orchestrated using Dagster, ensuring they run with daily frequency. The entire data pipeline is deployed on AWS ECS.

Model Fine-Tuning

Main components used here are Comet-ML, Pytorch, and multiple libraries from HuggingFace, namely peft, bitsandbytes, datasets, transformers and trl. Together, these libraries are used to load and efficiently fine-tune an LLM (in this case Falcon-7B-Instruct) on a Q&A dataset. Comet-ML is used to log the experiments, allowing metric comparison between runs, and also storing critical artifacts such as the model file itself. These artifacts are then used by in the last (inference) pipeline. The model fine-tuning requires a GPU to train, Beam is used to access the needed infrastructure.

Chatbot

In this stage, a RAG-based chatbot application is orchestrated using Langchain. Based on the user's question, relevant documents are pulled from the Qdrant vectorDB and used to enhance the context. The fine-tuned LLM is pulled from Comet-ML and is used to generate the response. The application is deployed on Beam as an endpoint.

Guide

The three components are relatively decoupled, only interacting through Comet-ML and Qdrant, and such they can be run and deployed (if applicable) independently. For details on how to do this, check the READMEs of each component.

Future Improvements

Add automatic ECS deployment CD for the data ingestion pipeline
Monitor/evaluate the LLM outputs (e.g Opik)
Improve RAG performance
- Add contextual retrieval
- Add query expansion
- Add self query
Add more data sources

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
ai_news_bot		ai_news_bot
data_pipeline		data_pipeline
images		images
training_pipeline		training_pipeline
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tech News RAG-based Chatbot

Overall System

Data Ingestion

Model Fine-Tuning

Chatbot

Guide

Future Improvements

About

Releases

Packages

Languages

jdpsc/tech-news-rag

Folders and files

Latest commit

History

Repository files navigation

Tech News RAG-based Chatbot

Overall System

Data Ingestion

Model Fine-Tuning

Chatbot

Guide

Future Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages