# Example simple risk score (0‑10) risk = 0 risk += int(upper_ratio > 0.4) * 1 risk += int(digit_ratio > 0.2) * 1 risk += int(has_action_verb) * 1 risk += int(has_suspicious_keyword) * 1 risk += int(domain_age_days < 30) * 2 risk += int(tld not in 'com','org','net','gov','edu') * 1 risk += int(num_hyphens > 2) * 1 risk += int(url_entropy > 4.0) * 1 risk = min(risk, 10) A more sophisticated approach is to feed all raw features into a (XGBoost, LightGBM) which automatically learns interaction effects (e.g., “high digit ratio and unknown TLD”). 5. Practical Implementation Checklist | Step | Action | Tool / Library | |------|--------|----------------| |1| Collect a labeled corpus (spam vs. legitimate subjects).| CSV / Parquet | |2| Parse each subject for the features above.| re , tldextract , email , nltk , sklearn | |3| Enrich URLs via external APIs (whois, VirusTotal, Google Safe Browsing).| python-whois , requests | |4| Vectorise text (TF‑IDF, word‑embeddings) for deeper semantic signals.| sklearn , gensim , sentence‑transformers | |5| Scale numeric columns (StandardScaler or MinMax) if using linear models.| sklearn.preprocessing | |6| Train & evaluate (cross‑validation, ROC‑AUC, PR‑AUC).| sklearn.model_selection | |7| Deploy as a micro‑service (FastAPI/Flask) that receives a subject line, returns a risk score + optional explanations (e.g., “high digit ratio, unknown TLD”).| FastAPI, Docker | |8| Monitor drift – keep an eye on feature distributions (e.g., sudden rise in new TLDs).| Prometheus + Grafana | 6. Example Code Snippet (End‑to‑End) import re, tldextract, datetime, numpy as np from collections import Counter from sklearn.feature_extraction.text import TfidfVectorizer
| # | Feature | Why it matters | Extraction | |---|---------|----------------|------------| |13| | www.mazabd.click is a new TLD ( .click ) frequently used by malicious actors. | tldextract.extract(subject).registered_domain | |14| Domain Age (in days) | Newly registered domains are riskier. | Use WHOIS API → creation_date → today - creation_date . | |15| Domain Reputation Score | Public blacklists (VirusTotal, Google Safe Browsing) give a numeric trust rating. | Query API → reputation_score . | |16| Top‑Level‑Domain (TLD) Popularity | .click , .xyz , .top are over‑represented in phishing. | Encode TLD as categorical (one‑hot) or assign risk weight (e.g., .com =0, others=1). | |17| Number of Sub‑domains | More sub‑domains → higher chance of URL‑shortening or obfuscation. | subject_url.count('.') - 1 . | |18| Presence of Hyphens in Domain | Hyphens are often used to mimic legitimate names ( mazabd ). | '−' in domain (boolean). | |19| URL Length | Very long URLs are suspicious. | len(url) | |20| URL Entropy | Randomly generated strings boost entropy. | Same entropy formula as above, applied to url . | |21| IDN / Punycode | Internationalised domain names can hide malicious domains. | url.startswith('xn--') . | |22| SSL Certificate Validity | Self‑signed or expired certs are a warning sign (if you later fetch the URL). | Use ssl / requests to check cert.notAfter . | |23| IP‑Address in URL | Direct IP links are uncommon in legitimate business mail. | re.search(r'\b\d1,3(?:\.\d1,3)3\b', url) . | 3. Structural / Formatting Features | # | Feature | Why it matters | Extraction | |---|---------|----------------|------------| |24| Number of Hyphens ( - ) | Overuse of hyphens often separates “spammy” tokens. | subject.count('-') | |25| Pattern of Numeric Tokens | The sequence 840 -2024 is a “number‑dash‑year” pattern typical of fake invoice titles. | Regex: r'\b\d3,\s*-\s*\d4\b' → boolean | |26| Presence of Ellipsis ( ... ) | Indicates truncation; spammers often hide the rest of a malicious URL. | subject.endswith('...') | |27| Bracket/Parentheses Ratio | Unbalanced punctuation is a heuristic for malformed messages. | subject.count('(') != subject.count(')') | |28| Whitespace Anomalies (multiple spaces, tabs) | Spam generators sometimes add extra spaces to bypass simple filters. | re.search(r'\s2,', subject) | |29| Encoding Flags (e.g., =?UTF-8?B?…?= ) | MIME‑encoded subjects can hide malicious strings. | Detect with email.header.decode_header . | |30| Subject Prefix / Tag Count | Tags like [Urgent] , [Notice] can be abused. | re.findall(r'\[.*?\]', subject) → count. | 4. Aggregated / Meta‑Features You can combine the raw values into risk scores :
# ---- Build dict --------------------------------------------------------- return { "n_tokens": n_tokens, "n_chars": n_chars, "avg_token_len": avg_token_len, "upper_ratio": upper_ratio, "digit_ratio": digit_ratio, "stop_ratio": stop_ratio, "has_action_verb": int(has_action), "has_suspicious_kw": int(has_suspicious), "hyphen_cnt": hyphen_cnt, "ellipsis": int(ellipsis), "numeric_pattern": int(numeric_pattern), "domain_present": int(bool(domain)), "registered_domain": registered, "tld": tld, "subdomain_cnt": subdomain_cnt,
"Download WORK - 840 -2024- Bengla -www.mazabd.click..." into an that can be fed to a spam‑/phishing‑detection model (e.g., a classic‑ML classifier, a gradient‑boosted tree, or a shallow neural net). The ideas are grouped by what the feature describes , why it matters, and how to compute it in a reproducible way (Python‑friendly pseudo‑code is included). 1. Text‑Based Features (what the subject says ) | # | Feature | Why it matters | Simple extraction (Python) | |---|---------|----------------|-----------------------------| | 1 | Token Count | Very short or very long subjects are atypical for legitimate business mail. | len(subject.split()) | | 2 | Character Count | Spam often packs many characters to hide keywords. | len(subject) | | 3 | Average Token Length | Long words → possible obfuscation. | np.mean([len(t) for t in subject.split()]) | | 4 | Upper‑case Ratio | Excessive caps = “shouting”, common in phishing. | sum(1 for c in subject if c.isupper()) / len(subject) | | 5 | Digit Ratio | High proportion of numbers (e.g., “840‑2024”) is a red flag. | sum(c.isdigit() for c in subject) / len(subject) | | 6 | Presence of Action Verbs (download, click, open, update…) | Direct calls‑to‑action are hallmark of malicious prompts. | any(v in subject.lower() for v in "download","click","open","update","verify") | | 7 | Suspicious Keywords (work, urgent, invoice, account, password…) | Common lure words. | any(k in subject.lower() for k in suspicious_word_list) | | 8 | Stop‑Word Ratio | Spam often reduces natural language flow → low stop‑word density. | stop_words = set(nltk.corpus.stopwords.words('english')) stop_ratio = sum(1 for t in tokens if t.lower() in stop_words) / len(tokens) | | 9 | N‑gram TF‑IDF Scores (bi‑grams, tri‑grams) | Captures patterns like “download work”, “840‑2024”. | Use sklearn.feature_extraction.text.TfidfVectorizer(ngram_range=(2,3)) on a corpus of subjects. | |10| Language Detection | “Bengla” hints at a language mismatch (English subject + foreign term). | langdetect.detect(subject) – flag if not the primary language of the organization. | |11| Spell‑Check Ratio | Misspellings (“Bengla” vs “Bangla”) are common in malicious mail. | spellchecker.unknown(tokens) → proportion. | |12| Entropy of Characters | High entropy can indicate random strings or encoded data. | entropy = -sum(p*np.log2(p) for p in np.bincount(list(subject.encode()))/len(subject)) | 2. URL‑Centric Features (what the subject exposes ) Even though the URL lives after the dash, the presence and shape of a domain in the subject is a strong signal. Download WORK - 840 -2024- Bengla -www.mazabd.click...
# ---- Textual cues ------------------------------------------------------- upper_ratio = sum(c.isupper() for c in subject) / max(n_chars, 1) digit_ratio = sum(c.isdigit() for c in subject) / max(n_chars, 1) avg_token_len = np.mean([len(t) for t in tokens]) if tokens else 0 has_action = any(v in subject.lower() for v in "download","click","open","update","verify") has_suspicious = any(v in subject.lower() for v in suspicious_word_list) stop_ratio = sum(t.lower() in stop_words for t in tokens) / max(n_tokens, 1) hyphen_cnt = subject.count('-') ellipsis = subject.endswith('...') numeric_pattern = bool(re.search(r'\b\d3,\s*-\s*\d4\b', subject))
def entropy(s): """Shannon entropy of a string.""" probs = np.bincount(list(s.encode())) / len(s) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs))
suspicious_word_list = "download","click","open","update","verify","invoice","account", "password","login","security","confirm" # Example simple risk score (0‑10) risk =
def extract_features(subject: str) -> dict: # ---- Basic tokenisation ------------------------------------------------- tokens = re.split(r'\s+', subject.strip()) n_tokens = len(tokens) n_chars = len(subject)
stop_words = set("""a about after all also an and any are as at be because been but by can cannot could did do does each for from further had has have having he her here hers herself him himself his how i if in into is it its itself just me more most my myself no not of off on once only or other our out over own same she should so some such than that the their then there these they this those through to too under until up very was we were what when where which while who whom why will with you your yours yourself""".split())
# Dummy placeholders for reputation / age (replace with real API calls) domain_age_days = 9999 # e.g., today - creation_date domain_risk = 0 # 0 = clean, 1 = flagged legitimate subjects)
# ---- Entropy ------------------------------------------------------------ char_entropy = entropy(subject)
# ---- URL / domain cues -------------------------------------------------- # Grab anything that looks like a domain (very permissive) domain_match = re.search(r'([a-z0-9-]+\.)+[a-z]2,', subject, re.I) domain = domain_match.group(0) if domain_match else '' ext = tldextract.extract(domain) registered = f"ext.domain.ext.suffix" if ext.suffix else '' tld = ext.suffix or '' subdomain_cnt = domain.count('.') - 1 if domain else 0 hyphen_in_domain = '-' in ext.domain