Детекция спама: прохождение интервью¶

~3 минуты чтения

Предварительно: Определение задачи | Компоненты

Спам-детекция -- один из самых частых MLSD-кейсов, потому что проверяет сразу 5 компетенций: multi-layer architecture (rules + ML), adversarial robustness, precision/recall trade-off, feedback loops и масштабирование до 1M+ msg/sec. Ниже -- пошаговый framework на 45-60 минут с конкретными фразами, которые отличают senior-ответ от junior.

Interview Framework (45-60 min)¶

0-5 min:   Clarifying questions
5-15 min:  High-level design
15-30 min: Deep dive (features, model)
30-45 min: Adversarial handling
45-60 min: Extensions

Step 1: Clarifying Questions (5 min)¶

**Scope:**
- What type? (email, SMS, reviews, comments)
- What actions? (block, quarantine, flag)

**Scale:**
- Message volume?
- Current spam rate?

**Requirements:**
- Latency budget?
- Precision vs Recall priority?
- Feedback mechanism?

Step 2: High-Level Design (10 min)¶

Architecture¶

graph TD
    MSG["Message Input"] --> BL["Blocklist & Rules"]
    BL --> ML["ML Classifier"]
    ML --> ENS["Ensemble Decision"]
    ENS --> INBOX["INBOX<br/>(deliver)"]
    ENS --> SPAM["SPAM<br/>(block)"]
    ENS --> QR["QUARANTINE<br/>(review)"]

    style MSG fill:#e8eaf6,stroke:#3f51b5
    style BL fill:#fff3e0,stroke:#ef6c00
    style ML fill:#f3e5f5,stroke:#9c27b0
    style ENS fill:#e8eaf6,stroke:#3f51b5
    style INBOX fill:#e8f5e9,stroke:#4caf50
    style SPAM fill:#fce4ec,stroke:#c62828
    style QR fill:#fff3e0,stroke:#ef6c00

Pipeline¶

"Three layers of defense:

1. **Rules & Blocklists** (fast, high precision)
   - Known spam domains
   - Known spam phrases
   - IP blocklists
   - Catches 60% of spam instantly

2. **ML Classifier** (accurate, handles novel)
   - Content analysis
   - Sender reputation
   - Behavioral features
   - Catches remaining spam

3. **Ensemble Decision**
   - Combine rule scores + ML scores
   - Threshold for spam/ham/quarantine
   - Confidence-based routing"

Step 3: Deep Dive (15 min)¶

Feature Engineering¶

"Key features for spam detection..."

"1. **Content Features**
   - TF-IDF of text
   - Spam keyword presence
   - Link count and domains
   - Attachment types
   - BERT embeddings for semantic

2. **Sender Features**
   - Account age
   - Historical spam rate
   - Volume in last 24h
   - Recipient diversity
   - Authentication (SPF, DKIM)

3. **Network Features**
   - IP reputation
   - Domain age
   - Sender-recipient graph
   - Cluster membership

4. **Behavioral Features**
   - Sending velocity
   - Time of sending
   - Reply rate to sender
   - Similar messages sent"

Model Architecture¶

"Ensemble of models..."

"Why ensemble?
- Different models catch different patterns
- Reduces single point of failure
- Harder to game

Components:

1. **Gradient Boosting (XGBoost)**
   - Tabular features
   - Fast inference
   - Good for patterns

2. **BERT Classifier**
   - Content understanding
   - Semantic similarity
   - Handles paraphrasing

3. **Graph Neural Network**
   - Sender-recipient patterns
   - Coordinated spam detection
   - Community detection

Combination:
Score = 0.4×XGBoost + 0.4×BERT + 0.2×GNN"

Handling Adversarial Spam¶

"Spammers constantly adapt..."

"Evasion techniques:
1. L33t sp34k: 'v1agra'
2. Unicode tricks
3. Text in images
4. URL shorteners
5. Compromised accounts

Defenses:

1. **Robust Preprocessing**
   - Normalize unicode
   - Decode l33t speak
   - OCR for image spam

2. **Adversarial Training**
   - Train on adversarial examples
   - Data augmentation with variations

3. **Continuous Learning**
   - Daily model updates
   - User feedback integration

4. **Multi-signal Fusion**
   - Content alone not enough
   - Combine with sender, network"

Step 4: Trade-offs & Operations (15 min)¶

Precision vs Recall¶

"Critical trade-off..."

"High Precision (99.9%):
- Very few legit emails blocked
- Some spam gets through
- Users complain about spam

High Recall (99%):
- Catches almost all spam
- More false positives
- Important emails blocked

My approach:
- Tiered thresholds:
  - Block (high confidence): 99% precision
  - Quarantine (medium): 95% precision
  - Flag (low): 90% precision

- Different by importance:
  - Transactional: Lower spam threshold
  - Promotional: Higher spam threshold"

Feedback Loop¶

"User feedback is critical..."

"Signals:
1. 'Mark as spam' button
2. 'Not spam' from spam folder
3. Ignore/delete without opening

Implementation:
1. Real-time feedback collection
2. Weight recent feedback higher
3. Per-user models (optional)
4. Aggregate patterns for global model
5. Detect feedback manipulation (spammers)"

Cold Start¶

"New senders have no history..."

"Solutions:

1. **Content-only classification**
   - Rely on message content
   - Higher threshold for new

2. **Domain/IP reputation**
   - Use domain age
   - IP cluster reputation

3. **Authentication signals**
   - SPF, DKIM, DMARC
   - Verified senders pass

4. **Gradual trust building**
   - Start with quarantine
   - Build reputation over time"

Step 5: Extensions (10 min)¶

Common Questions¶

Q: How to detect coordinated spam attacks?

"Graph-based detection:

1. Build sender-recipient graph
2. Detect unusual clusters
3. Similar content from many senders
4. Time correlation analysis

Example:
- 100 new accounts
- All sending similar messages
- In short time window
→ Coordinated attack"

Q: How to handle phishing specifically?

"Phishing has specific signals:

1. URL analysis
   - Domain similarity to real brands
   - Typosquatting detection
   - URL shortener expansion

2. Visual similarity
   - Compare to known brand emails
   - Logo detection

3. Urgency signals
   - 'Your account will be closed'
   - 'Verify immediately'"

Interview Checklist¶

Must Cover:¶

Good to Cover:¶

Red Flags:¶

Sample Script¶

Interviewer: "Design spam detection for Gmail"

You: "Let me clarify - email spam including phishing,
or just promotional spam?"

Interviewer: "All types including phishing"

You: "And the scale?"

Interviewer: "300 billion emails per day"

You: "Here's my approach:
[Draw architecture]

Three-layer defense:
1. Fast rules/blocklists for known spam
2. ML classifier for sophisticated spam
3. Ensemble decision with confidence

For features:
- Content (TF-IDF, BERT embeddings)
- Sender reputation (history, authentication)
- Network signals (IP, domain, graph)

For the model, I'd use an ensemble:
- XGBoost for tabular features
- BERT for content understanding
- GNN for coordinated attacks

For adversarial handling:
- Robust preprocessing
- Adversarial training
- Continuous model updates

Critical: Very high precision (99.9%)
because blocking legit email is worse
than letting some spam through.

Shall I dive into any component?"

Типичные заблуждения¶

Заблуждение: достаточно обсудить только контентные фичи

На интервью кандидаты часто фокусируются только на text features (TF-IDF, BERT). Сильный ответ включает 4 группы: content (текст, ссылки), sender (reputation, account age), network (IP, domain, граф), behavioral (velocity, time patterns). Спамеры могут менять текст, но не могут одновременно подделать reputation по всем сигналам.

Заблуждение: adversarial handling -- это просто regex для l33t speak

Regex-нормализация ловит 5-10% обфускаций. Полноценная защита включает: unicode normalization (homoglyphs), OCR для image spam, URL shortener expansion + domain similarity check (typosquatting), и главное -- adversarial training на синтетических примерах + daily model updates с user feedback. Без continuous learning модель устаревает за 2-4 недели.

Вопросы с оценкой ответов¶

С чего вы начнёте проектирование spam detection системы на интервью?

"Начну с выбора модели -- BERT или XGBoost" -- skip clarifying questions, jump to implementation

"Первые 5 минут -- уточняющие вопросы: (1) Тип спама -- email, SMS, reviews? Определяет доступные сигналы. (2) Объём -- 1K или 1B msg/day? Определяет архитектуру. (3) Текущий spam rate? Определяет class imbalance strategy. (4) Latency budget? Email 50ms vs chat 10ms. (5) Precision vs Recall priority? Определяет threshold strategy. Только после этого рисую архитектуру."

Как обрабатывать coordinated spam attacks (100 новых аккаунтов, одинаковые сообщения)?

"Content-based детекция поймает одинаковые сообщения" -- не масштабируется, спамеры парафразируют

"Graph-based detection: (1) Строю sender-recipient graph, (2) Детектирую необычные кластеры -- много новых аккаунтов, связанных с одними recipients, (3) Near-duplicate detection через MinHash/SimHash на тексте, (4) Time correlation -- всплеск похожих сообщений в короткое окно. Combination: если 10+ новых аккаунтов отправляют сообщения с similarity > 0.8 в окне 1 час -- координированная атака с confidence > 0.95."