Spam detection mini engine for emails
Why Choose This Project?
Email spam is a major cybersecurity and productivity issue, with millions of spam/phishing emails sent daily. This project aims to build a mini spam detection engine that can classify emails as spam or legitimate using machine learning, NLP (Natural Language Processing), and rule-based filters. It helps organizations or individuals filter out harmful or irrelevant emails before they reach the inbox.
What You Get in This Project
-
A mini spam detection system that processes raw email text.
-
ML/NLP-based spam classifier (Naive Bayes / Logistic Regression / Random Forest).
-
Rule-based filters for blacklisted domains, suspicious keywords, or attachments.
-
Web-based interface or CLI tool to test emails.
-
Accuracy reports and confusion matrix for evaluation.
Technology Stack
| Layer | Technology |
|---|---|
| Data | Enron Spam Dataset / SpamAssassin Corpus |
| Language | Python |
| ML Models | Naive Bayes, Logistic Regression, Random Forest, SVM |
| NLP Tools | NLTK, Scikit-learn, spaCy, TF-IDF Vectorizer |
| Backend (optional) | Flask / Django API for email testing |
| Frontend (optional) | HTML, CSS, JavaScript for UI |
| Database (optional) | SQLite / PostgreSQL for storing results |
Key Features
| Feature | Description |
|---|---|
| Email Preprocessing | Cleans emails (stopwords removal, stemming, tokenization) |
| Spam Classification | ML model predicts if email is spam/ham |
| Keyword & Rule Filters | Detects suspicious patterns like "win money", "lottery", "free gift" |
| Blacklist Matching | Checks against blacklisted IPs, domains, or senders |
| Attachment Scanning | Flags malicious file types (e.g., .exe, .js) |
| Accuracy Evaluation | Reports precision, recall, F1-score |
| User Interface | Web app/CLI for entering emails and getting spam score |
| Confidence Score | Displays spam probability (e.g., 92% spam) |
How It Works
1. Data Preprocessing
-
Load raw email dataset.
-
Clean subject + body (remove HTML tags, punctuation, stopwords).
-
Convert text into numerical form using Bag of Words / TF-IDF.
2. Model Training
-
Train ML models (Naive Bayes, Logistic Regression, Random Forest).
-
Evaluate using test data → measure accuracy, precision, recall, F1.
3. Spam Detection Workflow
-
User submits an email text.
-
Preprocessing + TF-IDF transformation.
-
ML model assigns spam / not spam label with probability.
-
Rule-based filters add extra detection (keywords, sender blacklist).
4. Output & Dashboard
-
Shows spam score (%) and decision (Spam/Ham).
-
Admin/user can mark false positives/negatives to improve model.
Security Features
-
Blacklist Integration → Detects known spammer domains/IPs.
-
Heuristic Rules → Flags emails with phishing-like patterns (too many links, obfuscated text).
-
Attachment Filtering → Blocks dangerous file types.
-
Model Retraining → Continuously improves detection with new spam samples.
-
Explainability → Highlights which words/phrases triggered spam detection.