DL Interview: GNN & Reinforcement Learning¶

~8 минут чтения

Навигация: Все темы DL интервью | Материалы DL | Математика для ML

Graph Neural Networks (GCN, GAT, GraphSAGE, GIN, Over-smoothing, WL test, Heterogeneous Graphs, Scalability), Reinforcement Learning (Q-Learning, DQN, Policy Gradient, PPO, SAC, Actor-Critic, Exploration).

Graph Neural Networks (GNN)¶

Источники: 350 GNN Interview Questions (HackMD), PyTorch Geometric docs, Lilian Weng blog

Q: Что такое Message Passing в GNN?¶

A:

Core idea: Узлы агрегируют информацию от соседей и обновляют своё представление.

\[h_v^{(l+1)} = \text{UPDATE}\left(h_v^{(l)}, \text{AGGREGATE}\left(\{h_u^{(l)} : u \in \mathcal{N}(v)\}\right)\right)\]

Ключевые компоненты: 1. Aggregate: Собрать features соседей (sum, mean, max, attention-weighted) 2. Update: Применить трансформацию (MLP, GRU) 3. Readout: Pooling для graph-level задач

# Message Passing в PyG
from torch_geometric.nn import MessagePassing

class CustomGNN(MessagePassing):
    def forward(self, x, edge_index):
        return self.propagate(edge_index, x=x)

    def message(self, x_j):  # x_j = features соседей
        return x_j

    def aggregate(self, inputs, index):
        return scatter_mean(inputs, index, dim=0)

Q: GCN vs GAT vs GraphSAGE -- в чём разница?¶

A:

Aspect	GCN	GAT	GraphSAGE
Aggregation	Fixed weighted sum	Learned attention	Sample + aggregate
Weight	$\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$	Attention $\alpha_{ij}$	Mean/LSTM/Pool
Learning	Transductive	Transductive	Inductive
New nodes	Retrain	Retrain	Direct inference
Complexity	$O(\\|E\\|d)$	$O(\\|E\\|dh)$	$O(b \cdot k^L d)$

GCN formula: $$H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)$$

GAT attention: $$\alpha_{ij} = \text{softmax}\left(\text{LeakyReLU}(a^T [Wh_i \| Wh_j])\right)$$

GraphSAGE sampling:

# Mini-batch training with neighbor sampling
loader = NeighborLoader(data, num_neighbors=[25, 10], batch_size=1024)
for batch in loader:
    out = model(batch.x, batch.edge_index)

Q: Transductive vs Inductive learning в GNN?¶

A:

Transductive	Inductive
Test nodes seen during training	Test nodes unseen during training
GCN, GAT	GraphSAGE
Full graph in memory	Mini-batch training
Can't generalize to new graphs	Generalizes to new graphs/nodes

When Inductive: - New users in social network - New molecules in drug discovery - Production deployment

When Transductive: - Fixed knowledge graph - Semi-supervised node classification - Research benchmarks (Cora, PubMed)

Q: Что такое Over-smoothing в GNN?¶

A:

Problem: При увеличении глубины GNN representations всех узлов становятся одинаковыми.

Why happens: - Message passing = repeated Laplacian smoothing - После многих слоёв node features converge к stationary distribution - Node representations становятся неразличимыми

Detection: - Node embeddings similarity increases with depth - Performance degrades after 3-5 layers

Solutions: 1. Jumping Knowledge (JK): Combine outputs from all layers 2. Residual connections: $h^{(l+1)} = h^{(l)} + \text{GNN}(h^{(l)})$ 3. PairNorm: Normalize across nodes, not features 4. Fewer layers: Usually 2-3 layers enough 5. APPNP: Personalized PageRank propagation

# JK-Net pattern
class JKNet(nn.Module):
    def forward(self, x, edge_index):
        layer_outputs = []
        for conv in self.convs:
            x = conv(x, edge_index)
            layer_outputs.append(x)
        return torch.cat(layer_outputs, dim=-1)  # or max/attention

Q: GIN (Graph Isomorphism Network) -- почему мощнее GCN?¶

A:

WL Test connection: GIN can distinguish graphs as well as Weisfeiler-Lehman test.

Key difference: Sum aggregation (not mean/max) preserves more information.

GIN update: $$h_v^{(k)} = \text{MLP}^{(k)}\left((1 + \epsilon^{(k)}) \cdot h_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} h_u^{(k-1)}\right)$$

Why sum > mean: - Mean: {1,1,1} and {1} -> same mean (1.0), but different multisets! - Sum: {1,1,1} -> 3, {1} -> 1 -> different, distinguishable - Max: {1,1,1} -> 1, {1} -> 1 -> same, loses cardinality - Sum preserves multiset cardinality information, critical for WL test expressiveness

When GIN: Graph classification, fingerprinting, isomorphism tasks.

Q: Что такое WL (Weisfeiler-Lehman) test и как связан с GNN?¶

A:

WL test: Алгоритм проверки изоморфизма графов через итеративную recoloring.

Connection to GIN: - GIN (Graph Isomorphism Network) -- единственный GNN, который может различить всё, что различает WL test - GCN/GAT не WL-equivalent (потеря информации при mean/max aggregation)

GIN update rule: $$h_v^{(l+1)} = \text{MLP}\left((1 + \epsilon) \cdot h_v^{(l)} + \sum_{u \in \mathcal{N}(v)} h_u^{(l)}\right)$$

Key insight: Sum aggregation сохраняет full multiset information (mean/max -- нет).

Q: Heterogeneous Graphs -- как обрабатывать?¶

A:

Heterogeneous graph: Разные типы узлов и рёбер (user-item-author-paper).

Approaches: 1. R-GCN (Relational GCN): Разные веса для каждого edge type 2. HAN (Heterogeneous Attention): Attention на metapaths 3. CompGCN: Composition functions для relations

R-GCN aggregation: $$h_v^{(l+1)} = \sigma\left(\sum_{r \in \mathcal{R}} \sum_{u \in \mathcal{N}_r(v)} \frac{1}{|\mathcal{N}_r(v)|} W_r^{(l)} h_u^{(l)} + W_0^{(l)} h_v^{(l)}\right)$$

Где $r$ -- тип отношения, $W_r$ -- веса для каждого типа.

# R-GCN в PyG
from torch_geometric.nn import FastRGCNConv

conv = FastRGCNConv(in_channels, out_channels, num_relations=5)
out = conv(x, edge_index, edge_type)  # edge_type = [0,1,2,3,4]

Metapath example: User -> Paper -> Author -> Paper -> User (U-P-A-P-U similarity)

Q: Как GNN используются в recommendation?¶

A:

User-Item Bipartite Graph: - Nodes: Users and Items - Edges: Interactions (click, purchase, rating)

Approaches: 1. GraphSAGE-Rec: Embed users/items via GNN, recommend by similarity 2. PinSage (Pinterest): Random walk sampling + GraphSAGE for pins 3. LightGCN: Simplified GCN without nonlinearities

Key insight: GNN captures collaborative filtering through graph structure.

# LightGCN propagation
def LightGCN_propagate(user_emb, item_emb, adj):
    # adj: user-item adjacency matrix
    all_emb = torch.cat([user_emb, item_emb])
    all_emb = torch.sparse.mm(adj, all_emb)  # One-hop neighbors
    return all_emb

Q: Scalability GNN на больших графах -- техники?¶

A:

Problems: O(V^2) memory, full-graph loading infeasible for billions of nodes.

Solutions:

Neighbor Sampling (GraphSAGE): Sample k neighbors instead of all

# Instead of all neighbors
neighbors = sample(neighbors, k=10)

Cluster-GCN: Partition graph into clusters, train on subgraphs
GraphSAGE Mini-batch: Sample subgraph around target nodes
PinSage (Pinterest): Random walk sampling for production-scale
Distributed training: Split graph across machines (DGL, PyG)

Production tips: - Pre-compute embeddings for static graphs - Use approximate nearest neighbor for inference - Cache frequently accessed node features

Q: Common GNN datasets и benchmarks?¶

A:

Node Classification: - Cora, Citeseer, Pubmed: Citation networks (papers = nodes, citations = edges) - PPI: Protein-protein interaction (inductive setting) - OGBN-Arxiv: 1.7M papers, large-scale benchmark

Graph Classification: - QM9: Molecular properties (130K molecules) - PROTEINS: Protein function prediction - OGBG-MolHIV: Molecule HIV inhibition

Link Prediction: - ogbl-ddi: Drug-drug interactions - ogbl-collab: Author collaboration

Knowledge Graphs: - FB15k-237: Freebase subset - WN18RR: WordNet relations

Libraries: PyTorch Geometric (PyG), DGL, OGB (Open Graph Benchmark)

Reinforcement Learning¶

Источники: Analytics Vidhya Actor-Critic, Index.dev RL Questions, Stable-Baselines3 docs

Q: Value-based vs Policy-based vs Actor-Critic -- в чём разница?¶

A:

Approach	Learns	Pros	Cons
Value-based (Q-learning, DQN)	$Q(s,a)$ = value of action	Sample efficient	Continuous action hard
Policy-based (REINFORCE)	$\pi_\theta(a\\|s)$ directly	Works with continuous	High variance
Actor-Critic	Both $Q(s,a)$ + $\pi_\theta(a\\|s)$	Best of both	More complex

When to use: - Value-based: Discrete actions, need sample efficiency (Atari games) - Policy-based: Continuous actions, need stability (robotics) - Actor-Critic: General purpose, best of both worlds

class ActorCritic(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.actor = nn.Sequential(
            nn.Linear(obs_dim, 64), nn.ReLU(),
            nn.Linear(64, act_dim), nn.Softmax(dim=-1)
        )
        self.critic = nn.Sequential(
            nn.Linear(obs_dim, 64), nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, x):
        return self.actor(x), self.critic(x)

Q: Q-Learning -- как работает?¶

A:

Core idea: Learn Q(s,a) = expected cumulative reward from state s, action a.

Bellman equation: $$Q^*(s,a) = r + \gamma \max_{a'} Q^*(s', a')$$

Update rule: $$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$

Where: - $\alpha$ = learning rate - $\gamma$ = discount factor (0.99 typical) - $r$ = immediate reward - $\max_{a'} Q(s',a')$ = best future value

Tabular Q-learning:

Q = np.zeros((n_states, n_actions))

for episode in range(n_episodes):
    state = env.reset()
    while not done:
        # epsilon-greedy action selection
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])

        next_state, reward, done = env.step(action)

        # Q-learning update
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state

Q: Deep Q-Network (DQN) -- ключевые инновации?¶

A:

Problem: Tabular Q-learning doesn't scale to large/continuous state spaces.

Solution: Neural network approximates Q(s,a;theta).

Two key innovations:

1. Experience Replay:

class ReplayBuffer:
    def __init__(self, capacity=100000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

Why: Breaks correlation between consecutive samples, improves data efficiency.

2. Target Network:

# Separate network for stable targets
if step % target_update_freq == 0:
    target_network.load_state_dict(policy_network.state_dict())

Why: Prevents oscillations from chasing a moving target.

DQN Loss: $$L(\theta) = \mathbb{E}_{(s,a,r,s')} \left[ \left( r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta) \right)^2 \right]$$

Q: Double DQN vs Dueling DQN?¶

A:

Double DQN: Addresses overestimation bias.

Standard DQN: $\max_{a'} Q(s',a')$ tends to overestimate.

DDQN solution: Decouple action selection from evaluation. $$Y^{\text{DDQN}} = r + \gamma Q\left(s', \arg\max_{a'} Q(s',a';\theta); \theta^-\right)$$

Online network chooses action, target network evaluates it.

Dueling DQN: Decomposes Q into value + advantage.

\[Q(s,a) = V(s) + \left( A(s,a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s,a') \right)\]

Where: - $V(s)$ = value of being in state s - $A(s,a)$ = advantage of taking action a vs average

When: Useful when action choice doesn't affect value much in some states.

Q: Policy Gradient Theorem -- суть?¶

A:

Goal: Maximize expected return $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \gamma^t r_t]$.

Theorem: Gradient can be computed without knowing environment dynamics. $$\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$$

Where: - $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ = return from timestep t - $\nabla_\theta \log \pi_\theta(a_t|s_t)$ = gradient of log-probability

Intuition: - Increase probability of actions that led to high returns - Decrease probability of actions that led to low returns

Q: REINFORCE algorithm -- как работает?¶

A:

Vanilla policy gradient:

def train_reinforce(policy, optimizer, trajectories, gamma=0.99):
    policy_loss = []

    for states, actions, rewards in trajectories:
        # Compute discounted returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)

        # Normalize (reduces variance)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)

        # Compute policy gradient
        for state, action, G in zip(states, actions, returns):
            probs = policy(state)
            log_prob = torch.log(probs[action])
            policy_loss.append(-log_prob * G)  # Negative for gradient ascent

    optimizer.zero_grad()
    loss = torch.stack(policy_loss).sum()
    loss.backward()
    optimizer.step()

Problem: High variance in returns.

Solution: Use advantage function $A(s,a) = Q(s,a) - V(s)$ as baseline.

Q: PPO (Proximal Policy Optimization) -- почему популярен?¶

A:

Problem: Policy gradient can make destructively large updates.

PPO solution: Clip the policy update to stay close to old policy.

Clipped objective: $$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

Where: - $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ = probability ratio - $\hat{A}_t$ = estimated advantage - $\epsilon$ = clipping parameter (0.2 typical)

Why popular: - Stable training (no huge policy jumps) - Works for discrete and continuous actions - Simple to implement - Good default choice for most problems

from stable_baselines3 import PPO

model = PPO('MlpPolicy', env, learning_rate=3e-4,
            n_steps=2048, batch_size=64, n_epochs=10, clip_range=0.2)
model.learn(total_timesteps=100000)

Q: Actor-Critic методы -- что это?¶

A:

Idea: Combine value-based and policy-based methods.

Actor: Policy pi_theta(a|s) -- chooses actions
Critic: Value function V_phi(s) -- evaluates states

Advantage estimation: $$A(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$$

Loss: $$L = L_{\text{actor}} + c_1 L_{\text{critic}} + c_2 L_{\text{entropy}}$$

Variants: - A2C (Advantage Actor-Critic): Synchronous, single worker - A3C: Asynchronous, multiple workers - SAC (Soft Actor-Critic): Off-policy, entropy regularization, continuous actions - TD3 (Twin Delayed DDPG): Twin critics, delayed updates, continuous actions

Q: SAC (Soft Actor-Critic) -- ключевые особенности?¶

A:

Objective: Maximize reward + entropy. $$J(\pi) = \mathbb{E}_\tau \left[ \sum_t \gamma^t (r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))) \right]$$

Where $\mathcal{H}(\pi) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s)]$ is entropy.

Key features: 1. Entropy bonus: Encourages exploration, prevents premature convergence 2. Off-policy: Uses replay buffer, sample efficient 3. Automatic temperature: alpha tuned automatically

When SAC: - Continuous control (robotics, mujoco) - Need sample efficiency - Want robust exploration

from stable_baselines3 import SAC

model = SAC('MlpPolicy', env, learning_rate=3e-4,
            buffer_size=1000000, batch_size=256, ent_coef='auto')

Q: Когда какой RL алгоритм использовать?¶

A:

Decision tree:

Discrete actions?
+-- Yes -> Simple environment?
|   +-- Yes -> DQN
|   +-- No -> PPO
+-- No (Continuous) -> Need sample efficiency?
    +-- Yes -> SAC (or TD3)
    +-- No -> PPO

Algorithm	Actions	Sample Eff.	Stability	Use Case
DQN	Discrete	High	Medium	Atari, games
PPO	Both	Medium	High	General purpose, robotics
SAC	Continuous	High	High	Continuous control, robotics
TD3	Continuous	High	High	Faster than SAC, less tuning

Practical tips: 1. Start with PPO -- most robust 2. Use Stable Baselines3 -- don't reinvent 3. Monitor: episode reward, value loss, entropy 4. Tune: learning rate, batch size first

Q: Exploration-Exploitation trade-off в RL?¶

A:

Problem: Balance exploring new states vs exploiting known good actions.

Approaches:

1. epsilon-greedy (DQN):

epsilon = max(epsilon_end, epsilon_start * decay**step)
if random.random() < epsilon:
    action = random_action()  # Explore
else:
    action = argmax(Q(state))  # Exploit

2. Entropy regularization (SAC, PPO): $$L = L_{\text{task}} - \alpha \mathcal{H}(\pi)$$

Higher entropy = more exploration.

# Entropy regularization in PPO
entropy = -torch.sum(probs * torch.log(probs), dim=-1)
loss = policy_loss - entropy_coef * entropy.mean()  # Maximize entropy

3. Curiosity-driven: Add intrinsic reward for visiting novel states. $$r_{\text{total}} = r_{\text{external}} + \eta \cdot r_{\text{intrinsic}}$$

Q: Reward Hacking -- что это и как избежать?¶

A:

Problem: Agent finds unintended ways to maximize reward.

Examples: - Boat racing: Spins in circles to collect power-ups - Coast runners: Drives backwards for higher score - Video games: Pauses to avoid losing

Solutions: 1. Careful reward design: Align reward with true objective 2. Reward shaping: Potential-based shaping doesn't change optimal policy 3. Multiple objectives: Combine several metrics 4. Human oversight: Regular evaluation of agent behavior 5. Inverse RL: Learn reward from demonstrations

Potential-based shaping: $$F(s,a,s') = \gamma \Phi(s') - \Phi(s)$$

This guarantees optimal policy unchanged.

Approach	Learns	Pros	Cons
Value-based (Q-learning, DQN)	\(Q(s,a)\) = value of action	Sample efficient	Continuous action hard
Policy-based (REINFORCE)	\(\pi_\theta(a\\|s)\) directly	Works with continuous	High variance
Actor-Critic	Both \(Q(s,a)\) + \(\pi_\theta(a\\|s)\)	Best of both	More complex