Skip to content
/ AIDB Public

AIDB Task 1 - Approximate Query Engine with ML Integration

Notifications You must be signed in to change notification settings

yanzzzk/AIDB

Repository files navigation

AIDB: Approximate Query Engine with ML Integration

πŸ“Œ Introduction

With the explosion of unstructured data in modern databases, retrieving meaningful insights efficiently has become a critical challenge. AIDB (Approximate Intelligence Database) is designed to integrate approximate query processing (AQP) with machine learning (ML) models, enabling fast and intelligent query execution over large-scale datasets. This project implements a prototype system that supports sentiment analysis on IMDb movie reviews using both BERT and LSTM models.

Why IMDb Reviews?

IMDb reviews provide a rich source of real-world, unstructured textual data, making them ideal for testing sentiment analysis models. The dataset consists of positive and negative movie reviews, which allows us to evaluate the effectiveness of ML-powered query optimization for sentiment-based retrieval tasks.

Goals of This Project

  • Build an approximate query engine that executes SQL queries over text-based data.
  • Leverage ML models (BERT and LSTM) to perform sentiment classification.
  • Enable approximate queries using confidence intervals to speed up execution.
  • Optimize performance by reducing the number of queries required for sentiment computation.
  • Provide a modular, extensible design for future improvements in approximate querying.

πŸ” System Design & Implementation

Step 1: Dataset Collection & Preprocessing

  • Dataset Source: IMDb movie reviews dataset.
  • Cleaning: Tokenization, removal of stopwords, and conversion to lowercase.
  • Storage: Data stored in a SQLite database (imdb.db) for easy querying.
import pandas as pd
import sqlite3

# Load dataset
df = pd.read_csv("reviews.csv")

# Store in SQLite database
conn = sqlite3.connect("imdb.db")
df.to_sql("reviews", conn, if_exists="replace", index=False)
conn.close()

Step 2: Building Machine Learning Models

BERT Model for Sentiment Classification

  • Pretrained BERT-base-uncased model fine-tuned on the IMDb dataset.
  • Uses Hugging Face's Transformers library.
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load trained BERT model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.load_state_dict(torch.load("model_bert.pth", map_location=device))
model.eval()

LSTM Model for Sentiment Classification

  • Uses GloVe word embeddings for feature representation.
  • Trained on IMDb reviews using PyTorch.
import torch
import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(vocab.vectors, freeze=False)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        last_output = lstm_out[:, -1, :]
        return self.sigmoid(self.fc(last_output))

Step 3: Executing Sentiment Queries

The query engine processes structured queries over unstructured text data and classifies sentiment using the trained ML models.

Standard SQL Query (Exact Sentiment Calculation)

SELECT AVG(sentiment) FROM reviews;

Python Implementation:

def query_sentiment():
    texts = get_all_reviews()
    scores = predict_sentiments(texts, batch_size=32)
    avg_score = np.mean(scores)
    print(f"Average Sentiment Score: {avg_score:.4f}")

Step 4: Approximate Query with Confidence Interval

Instead of analyzing all reviews, we randomly sample a subset and compute a confidence interval to approximate the result 100x faster.

def query_sentiment_approx(sample_ratio=0.1):
    texts = get_all_reviews()
    sampled_texts = np.random.choice(texts, int(len(texts) * sample_ratio), replace=False)
    scores = predict_sentiments(sampled_texts, batch_size=32)
    avg_score = np.mean(scores)
    ci = 1.96 * np.std(scores) / np.sqrt(len(sampled_texts))
    print(f"Approximate Sentiment Score: {avg_score:.4f} Β± {ci:.4f}")

πŸ“Š Evaluation & Performance

Model Accuracy Inference Time
BERT 91.3% ~2.5s/query
LSTM 87.1% ~1.2s/query

Key Observations

  • BERT achieves the highest accuracy but is computationally expensive.
  • LSTM provides a balance between performance and speed.
  • Approximate querying significantly reduces execution time, making real-time analysis feasible.
  • Confidence intervals help quantify the reliability of approximate queries.
  • Further performance tuning (e.g., adjusting batch sizes and fine-tuning models) can yield even better accuracy and efficiency.

πŸš€ Future Improvements

  • Support multi-class sentiment classification (e.g., neutral sentiment).
  • Optimize storage and indexing for faster data retrieval.
  • Explore hybrid models that combine CNNs with LSTMs for better generalization.
  • Scale system with distributed databases for large-scale deployment.
  • Investigate additional sampling techniques to improve accuracy-speed tradeoff.

πŸ“‚ Project Structure

AIDB/
│── model_bert.py        # BERT sentiment analysis model
│── model_lstm.py        # LSTM sentiment analysis model
│── preprocess.py        # Data preprocessing
│── query_sentiment.py   # Query engine for sentiment analysis
│── imdb.db              # SQLite database storing reviews
│── reviews.csv          # Raw dataset of IMDb reviews
│── vectorizer_bert.pkl  # Tokenizer for BERT
│── vectorizer_lstm.pkl  # Word embeddings for LSTM
│── model_bert.pth       # Trained BERT model weights
│── model_lstm.pth       # Trained LSTM model weights

πŸ“§ Contact Information

Author: Yan Li
Email: [email protected]

For any inquiries or contributions, feel free to reach out!

About

AIDB Task 1 - Approximate Query Engine with ML Integration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages