Детекция спама: прохождение интервью¶
~3 минуты чтения
Предварительно: Определение задачи | Компоненты
Спам-детекция -- один из самых частых MLSD-кейсов, потому что проверяет сразу 5 компетенций: multi-layer architecture (rules + ML), adversarial robustness, precision/recall trade-off, feedback loops и масштабирование до 1M+ msg/sec. Ниже -- пошаговый framework на 45-60 минут с конкретными фразами, которые отличают senior-ответ от junior.
Interview Framework (45-60 min)¶
0-5 min: Clarifying questions
5-15 min: High-level design
15-30 min: Deep dive (features, model)
30-45 min: Adversarial handling
45-60 min: Extensions
Step 1: Clarifying Questions (5 min)¶
**Scope:**
- What type? (email, SMS, reviews, comments)
- What actions? (block, quarantine, flag)
**Scale:**
- Message volume?
- Current spam rate?
**Requirements:**
- Latency budget?
- Precision vs Recall priority?
- Feedback mechanism?
Step 2: High-Level Design (10 min)¶
Architecture¶
graph TD
MSG["Message Input"] --> BL["Blocklist & Rules"]
BL --> ML["ML Classifier"]
ML --> ENS["Ensemble Decision"]
ENS --> INBOX["INBOX<br/>(deliver)"]
ENS --> SPAM["SPAM<br/>(block)"]
ENS --> QR["QUARANTINE<br/>(review)"]
style MSG fill:#e8eaf6,stroke:#3f51b5
style BL fill:#fff3e0,stroke:#ef6c00
style ML fill:#f3e5f5,stroke:#9c27b0
style ENS fill:#e8eaf6,stroke:#3f51b5
style INBOX fill:#e8f5e9,stroke:#4caf50
style SPAM fill:#fce4ec,stroke:#c62828
style QR fill:#fff3e0,stroke:#ef6c00
Pipeline¶
"Three layers of defense:
1. **Rules & Blocklists** (fast, high precision)
- Known spam domains
- Known spam phrases
- IP blocklists
- Catches 60% of spam instantly
2. **ML Classifier** (accurate, handles novel)
- Content analysis
- Sender reputation
- Behavioral features
- Catches remaining spam
3. **Ensemble Decision**
- Combine rule scores + ML scores
- Threshold for spam/ham/quarantine
- Confidence-based routing"
Step 3: Deep Dive (15 min)¶
Feature Engineering¶
"Key features for spam detection..."
"1. **Content Features**
- TF-IDF of text
- Spam keyword presence
- Link count and domains
- Attachment types
- BERT embeddings for semantic
2. **Sender Features**
- Account age
- Historical spam rate
- Volume in last 24h
- Recipient diversity
- Authentication (SPF, DKIM)
3. **Network Features**
- IP reputation
- Domain age
- Sender-recipient graph
- Cluster membership
4. **Behavioral Features**
- Sending velocity
- Time of sending
- Reply rate to sender
- Similar messages sent"
Model Architecture¶
"Ensemble of models..."
"Why ensemble?
- Different models catch different patterns
- Reduces single point of failure
- Harder to game
Components:
1. **Gradient Boosting (XGBoost)**
- Tabular features
- Fast inference
- Good for patterns
2. **BERT Classifier**
- Content understanding
- Semantic similarity
- Handles paraphrasing
3. **Graph Neural Network**
- Sender-recipient patterns
- Coordinated spam detection
- Community detection
Combination:
Score = 0.4×XGBoost + 0.4×BERT + 0.2×GNN"
Handling Adversarial Spam¶
"Spammers constantly adapt..."
"Evasion techniques:
1. L33t sp34k: 'v1agra'
2. Unicode tricks
3. Text in images
4. URL shorteners
5. Compromised accounts
Defenses:
1. **Robust Preprocessing**
- Normalize unicode
- Decode l33t speak
- OCR for image spam
2. **Adversarial Training**
- Train on adversarial examples
- Data augmentation with variations
3. **Continuous Learning**
- Daily model updates
- User feedback integration
4. **Multi-signal Fusion**
- Content alone not enough
- Combine with sender, network"
Step 4: Trade-offs & Operations (15 min)¶
Precision vs Recall¶
"Critical trade-off..."
"High Precision (99.9%):
- Very few legit emails blocked
- Some spam gets through
- Users complain about spam
High Recall (99%):
- Catches almost all spam
- More false positives
- Important emails blocked
My approach:
- Tiered thresholds:
- Block (high confidence): 99% precision
- Quarantine (medium): 95% precision
- Flag (low): 90% precision
- Different by importance:
- Transactional: Lower spam threshold
- Promotional: Higher spam threshold"
Feedback Loop¶
"User feedback is critical..."
"Signals:
1. 'Mark as spam' button
2. 'Not spam' from spam folder
3. Ignore/delete without opening
Implementation:
1. Real-time feedback collection
2. Weight recent feedback higher
3. Per-user models (optional)
4. Aggregate patterns for global model
5. Detect feedback manipulation (spammers)"
Cold Start¶
"New senders have no history..."
"Solutions:
1. **Content-only classification**
- Rely on message content
- Higher threshold for new
2. **Domain/IP reputation**
- Use domain age
- IP cluster reputation
3. **Authentication signals**
- SPF, DKIM, DMARC
- Verified senders pass
4. **Gradual trust building**
- Start with quarantine
- Build reputation over time"
Step 5: Extensions (10 min)¶
Common Questions¶
Q: How to detect coordinated spam attacks?
"Graph-based detection:
1. Build sender-recipient graph
2. Detect unusual clusters
3. Similar content from many senders
4. Time correlation analysis
Example:
- 100 new accounts
- All sending similar messages
- In short time window
→ Coordinated attack"
Q: How to handle phishing specifically?
"Phishing has specific signals:
1. URL analysis
- Domain similarity to real brands
- Typosquatting detection
- URL shortener expansion
2. Visual similarity
- Compare to known brand emails
- Logo detection
3. Urgency signals
- 'Your account will be closed'
- 'Verify immediately'"
Interview Checklist¶
Must Cover:¶
- Multi-layer defense (rules + ML)
- Content + sender + network features
- Precision vs recall trade-off
- Adversarial handling
- Feedback loop
Good to Cover:¶
- Cold start handling
- Ensemble approach
- Continuous learning
- Phishing detection
- Coordinated attacks
Red Flags:¶
- Content-only analysis
- Ignoring adversarial nature
- No feedback mechanism
- Static model (no updates)
- Ignoring precision/recall trade-off
Sample Script¶
Interviewer: "Design spam detection for Gmail"
You: "Let me clarify - email spam including phishing,
or just promotional spam?"
Interviewer: "All types including phishing"
You: "And the scale?"
Interviewer: "300 billion emails per day"
You: "Here's my approach:
[Draw architecture]
Three-layer defense:
1. Fast rules/blocklists for known spam
2. ML classifier for sophisticated spam
3. Ensemble decision with confidence
For features:
- Content (TF-IDF, BERT embeddings)
- Sender reputation (history, authentication)
- Network signals (IP, domain, graph)
For the model, I'd use an ensemble:
- XGBoost for tabular features
- BERT for content understanding
- GNN for coordinated attacks
For adversarial handling:
- Robust preprocessing
- Adversarial training
- Continuous model updates
Critical: Very high precision (99.9%)
because blocking legit email is worse
than letting some spam through.
Shall I dive into any component?"
Типичные заблуждения¶
Заблуждение: достаточно обсудить только контентные фичи
На интервью кандидаты часто фокусируются только на text features (TF-IDF, BERT). Сильный ответ включает 4 группы: content (текст, ссылки), sender (reputation, account age), network (IP, domain, граф), behavioral (velocity, time patterns). Спамеры могут менять текст, но не могут одновременно подделать reputation по всем сигналам.
Заблуждение: adversarial handling -- это просто regex для l33t speak
Regex-нормализация ловит 5-10% обфускаций. Полноценная защита включает: unicode normalization (homoglyphs), OCR для image spam, URL shortener expansion + domain similarity check (typosquatting), и главное -- adversarial training на синтетических примерах + daily model updates с user feedback. Без continuous learning модель устаревает за 2-4 недели.
Вопросы с оценкой ответов¶
С чего вы начнёте проектирование spam detection системы на интервью?
"Начну с выбора модели -- BERT или XGBoost" -- skip clarifying questions, jump to implementation
"Первые 5 минут -- уточняющие вопросы: (1) Тип спама -- email, SMS, reviews? Определяет доступные сигналы. (2) Объём -- 1K или 1B msg/day? Определяет архитектуру. (3) Текущий spam rate? Определяет class imbalance strategy. (4) Latency budget? Email 50ms vs chat 10ms. (5) Precision vs Recall priority? Определяет threshold strategy. Только после этого рисую архитектуру."
Как обрабатывать coordinated spam attacks (100 новых аккаунтов, одинаковые сообщения)?
"Content-based детекция поймает одинаковые сообщения" -- не масштабируется, спамеры парафразируют
"Graph-based detection: (1) Строю sender-recipient graph, (2) Детектирую необычные кластеры -- много новых аккаунтов, связанных с одними recipients, (3) Near-duplicate detection через MinHash/SimHash на тексте, (4) Time correlation -- всплеск похожих сообщений в короткое окно. Combination: если 10+ новых аккаунтов отправляют сообщения с similarity > 0.8 в окне 1 час -- координированная атака с confidence > 0.95."