With the explosion of unstructured data in modern databases, retrieving meaningful insights efficiently has become a critical challenge. AIDB (Approximate Intelligence Database) is designed to integrate approximate query processing (AQP) with machine learning (ML) models, enabling fast and intelligent query execution over large-scale datasets. This project implements a prototype system that supports sentiment analysis on IMDb movie reviews using both BERT and LSTM models.
IMDb reviews provide a rich source of real-world, unstructured textual data, making them ideal for testing sentiment analysis models. The dataset consists of positive and negative movie reviews, which allows us to evaluate the effectiveness of ML-powered query optimization for sentiment-based retrieval tasks.
- Build an approximate query engine that executes SQL queries over text-based data.
- Leverage ML models (BERT and LSTM) to perform sentiment classification.
- Enable approximate queries using confidence intervals to speed up execution.
- Optimize performance by reducing the number of queries required for sentiment computation.
- Provide a modular, extensible design for future improvements in approximate querying.
- Dataset Source: IMDb movie reviews dataset.
- Cleaning: Tokenization, removal of stopwords, and conversion to lowercase.
- Storage: Data stored in a SQLite database (
imdb.db
) for easy querying.
import pandas as pd
import sqlite3
# Load dataset
df = pd.read_csv("reviews.csv")
# Store in SQLite database
conn = sqlite3.connect("imdb.db")
df.to_sql("reviews", conn, if_exists="replace", index=False)
conn.close()
- Pretrained BERT-base-uncased model fine-tuned on the IMDb dataset.
- Uses Hugging Face's Transformers library.
from transformers import BertTokenizer, BertForSequenceClassification
import torch
# Load trained BERT model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.load_state_dict(torch.load("model_bert.pth", map_location=device))
model.eval()
- Uses GloVe word embeddings for feature representation.
- Trained on IMDb reviews using PyTorch.
import torch
import torch.nn as nn
class LSTMModel(nn.Module):
def __init__(self, embedding_dim, hidden_dim, output_dim):
super(LSTMModel, self).__init__()
self.embedding = nn.Embedding.from_pretrained(vocab.vectors, freeze=False)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
embedded = self.embedding(x)
lstm_out, _ = self.lstm(embedded)
last_output = lstm_out[:, -1, :]
return self.sigmoid(self.fc(last_output))
The query engine processes structured queries over unstructured text data and classifies sentiment using the trained ML models.
SELECT AVG(sentiment) FROM reviews;
Python Implementation:
def query_sentiment():
texts = get_all_reviews()
scores = predict_sentiments(texts, batch_size=32)
avg_score = np.mean(scores)
print(f"Average Sentiment Score: {avg_score:.4f}")
Instead of analyzing all reviews, we randomly sample a subset and compute a confidence interval to approximate the result 100x faster.
def query_sentiment_approx(sample_ratio=0.1):
texts = get_all_reviews()
sampled_texts = np.random.choice(texts, int(len(texts) * sample_ratio), replace=False)
scores = predict_sentiments(sampled_texts, batch_size=32)
avg_score = np.mean(scores)
ci = 1.96 * np.std(scores) / np.sqrt(len(sampled_texts))
print(f"Approximate Sentiment Score: {avg_score:.4f} Β± {ci:.4f}")
Model | Accuracy | Inference Time |
---|---|---|
BERT | 91.3% | ~2.5s/query |
LSTM | 87.1% | ~1.2s/query |
- BERT achieves the highest accuracy but is computationally expensive.
- LSTM provides a balance between performance and speed.
- Approximate querying significantly reduces execution time, making real-time analysis feasible.
- Confidence intervals help quantify the reliability of approximate queries.
- Further performance tuning (e.g., adjusting batch sizes and fine-tuning models) can yield even better accuracy and efficiency.
- Support multi-class sentiment classification (e.g., neutral sentiment).
- Optimize storage and indexing for faster data retrieval.
- Explore hybrid models that combine CNNs with LSTMs for better generalization.
- Scale system with distributed databases for large-scale deployment.
- Investigate additional sampling techniques to improve accuracy-speed tradeoff.
AIDB/
βββ model_bert.py # BERT sentiment analysis model
βββ model_lstm.py # LSTM sentiment analysis model
βββ preprocess.py # Data preprocessing
βββ query_sentiment.py # Query engine for sentiment analysis
βββ imdb.db # SQLite database storing reviews
βββ reviews.csv # Raw dataset of IMDb reviews
βββ vectorizer_bert.pkl # Tokenizer for BERT
βββ vectorizer_lstm.pkl # Word embeddings for LSTM
βββ model_bert.pth # Trained BERT model weights
βββ model_lstm.pth # Trained LSTM model weights
Author: Yan Li
Email: [email protected]
For any inquiries or contributions, feel free to reach out!