DL Interview: GNN & Reinforcement Learning¶
~8 минут чтения
Навигация: Все темы DL интервью | Материалы DL | Математика для ML
Graph Neural Networks (GCN, GAT, GraphSAGE, GIN, Over-smoothing, WL test, Heterogeneous Graphs, Scalability), Reinforcement Learning (Q-Learning, DQN, Policy Gradient, PPO, SAC, Actor-Critic, Exploration).
Graph Neural Networks (GNN)¶
Источники: 350 GNN Interview Questions (HackMD), PyTorch Geometric docs, Lilian Weng blog
Q: Что такое Message Passing в GNN?¶
A:
Core idea: Узлы агрегируют информацию от соседей и обновляют своё представление.
Ключевые компоненты: 1. Aggregate: Собрать features соседей (sum, mean, max, attention-weighted) 2. Update: Применить трансформацию (MLP, GRU) 3. Readout: Pooling для graph-level задач
# Message Passing в PyG
from torch_geometric.nn import MessagePassing
class CustomGNN(MessagePassing):
def forward(self, x, edge_index):
return self.propagate(edge_index, x=x)
def message(self, x_j): # x_j = features соседей
return x_j
def aggregate(self, inputs, index):
return scatter_mean(inputs, index, dim=0)
Q: GCN vs GAT vs GraphSAGE -- в чём разница?¶
A:
| Aspect | GCN | GAT | GraphSAGE |
|---|---|---|---|
| Aggregation | Fixed weighted sum | Learned attention | Sample + aggregate |
| Weight | \(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}\) | Attention \(\alpha_{ij}\) | Mean/LSTM/Pool |
| Learning | Transductive | Transductive | Inductive |
| New nodes | Retrain | Retrain | Direct inference |
| Complexity | \(O(\|E\|d)\) | \(O(\|E\|dh)\) | \(O(b \cdot k^L d)\) |
GCN formula: $\(H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)\)$
GAT attention: $\(\alpha_{ij} = \text{softmax}\left(\text{LeakyReLU}(a^T [Wh_i \| Wh_j])\right)\)$
GraphSAGE sampling:
# Mini-batch training with neighbor sampling
loader = NeighborLoader(data, num_neighbors=[25, 10], batch_size=1024)
for batch in loader:
out = model(batch.x, batch.edge_index)
Q: Transductive vs Inductive learning в GNN?¶
A:
| Transductive | Inductive |
|---|---|
| Test nodes seen during training | Test nodes unseen during training |
| GCN, GAT | GraphSAGE |
| Full graph in memory | Mini-batch training |
| Can't generalize to new graphs | Generalizes to new graphs/nodes |
When Inductive: - New users in social network - New molecules in drug discovery - Production deployment
When Transductive: - Fixed knowledge graph - Semi-supervised node classification - Research benchmarks (Cora, PubMed)
Q: Что такое Over-smoothing в GNN?¶
A:
Problem: При увеличении глубины GNN representations всех узлов становятся одинаковыми.
Why happens: - Message passing = repeated Laplacian smoothing - После многих слоёв node features converge к stationary distribution - Node representations становятся неразличимыми
Detection: - Node embeddings similarity increases with depth - Performance degrades after 3-5 layers
Solutions: 1. Jumping Knowledge (JK): Combine outputs from all layers 2. Residual connections: \(h^{(l+1)} = h^{(l)} + \text{GNN}(h^{(l)})\) 3. PairNorm: Normalize across nodes, not features 4. Fewer layers: Usually 2-3 layers enough 5. APPNP: Personalized PageRank propagation
# JK-Net pattern
class JKNet(nn.Module):
def forward(self, x, edge_index):
layer_outputs = []
for conv in self.convs:
x = conv(x, edge_index)
layer_outputs.append(x)
return torch.cat(layer_outputs, dim=-1) # or max/attention
Q: GIN (Graph Isomorphism Network) -- почему мощнее GCN?¶
A:
WL Test connection: GIN can distinguish graphs as well as Weisfeiler-Lehman test.
Key difference: Sum aggregation (not mean/max) preserves more information.
GIN update: $\(h_v^{(k)} = \text{MLP}^{(k)}\left((1 + \epsilon^{(k)}) \cdot h_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} h_u^{(k-1)}\right)\)$
Why sum > mean: - Mean: {1,1,1} and {1} -> same mean (1.0), but different multisets! - Sum: {1,1,1} -> 3, {1} -> 1 -> different, distinguishable - Max: {1,1,1} -> 1, {1} -> 1 -> same, loses cardinality - Sum preserves multiset cardinality information, critical for WL test expressiveness
When GIN: Graph classification, fingerprinting, isomorphism tasks.
Q: Что такое WL (Weisfeiler-Lehman) test и как связан с GNN?¶
A:
WL test: Алгоритм проверки изоморфизма графов через итеративную recoloring.
Connection to GIN: - GIN (Graph Isomorphism Network) -- единственный GNN, который может различить всё, что различает WL test - GCN/GAT не WL-equivalent (потеря информации при mean/max aggregation)
GIN update rule: $\(h_v^{(l+1)} = \text{MLP}\left((1 + \epsilon) \cdot h_v^{(l)} + \sum_{u \in \mathcal{N}(v)} h_u^{(l)}\right)\)$
Key insight: Sum aggregation сохраняет full multiset information (mean/max -- нет).
Q: Heterogeneous Graphs -- как обрабатывать?¶
A:
Heterogeneous graph: Разные типы узлов и рёбер (user-item-author-paper).
Approaches: 1. R-GCN (Relational GCN): Разные веса для каждого edge type 2. HAN (Heterogeneous Attention): Attention на metapaths 3. CompGCN: Composition functions для relations
R-GCN aggregation: $\(h_v^{(l+1)} = \sigma\left(\sum_{r \in \mathcal{R}} \sum_{u \in \mathcal{N}_r(v)} \frac{1}{|\mathcal{N}_r(v)|} W_r^{(l)} h_u^{(l)} + W_0^{(l)} h_v^{(l)}\right)\)$
Где \(r\) -- тип отношения, \(W_r\) -- веса для каждого типа.
# R-GCN в PyG
from torch_geometric.nn import FastRGCNConv
conv = FastRGCNConv(in_channels, out_channels, num_relations=5)
out = conv(x, edge_index, edge_type) # edge_type = [0,1,2,3,4]
Metapath example: User -> Paper -> Author -> Paper -> User (U-P-A-P-U similarity)
Q: Как GNN используются в recommendation?¶
A:
User-Item Bipartite Graph: - Nodes: Users and Items - Edges: Interactions (click, purchase, rating)
Approaches: 1. GraphSAGE-Rec: Embed users/items via GNN, recommend by similarity 2. PinSage (Pinterest): Random walk sampling + GraphSAGE for pins 3. LightGCN: Simplified GCN without nonlinearities
Key insight: GNN captures collaborative filtering through graph structure.
# LightGCN propagation
def LightGCN_propagate(user_emb, item_emb, adj):
# adj: user-item adjacency matrix
all_emb = torch.cat([user_emb, item_emb])
all_emb = torch.sparse.mm(adj, all_emb) # One-hop neighbors
return all_emb
Q: Scalability GNN на больших графах -- техники?¶
A:
Problems: O(V^2) memory, full-graph loading infeasible for billions of nodes.
Solutions:
-
Neighbor Sampling (GraphSAGE): Sample k neighbors instead of all
-
Cluster-GCN: Partition graph into clusters, train on subgraphs
-
GraphSAGE Mini-batch: Sample subgraph around target nodes
-
PinSage (Pinterest): Random walk sampling for production-scale
-
Distributed training: Split graph across machines (DGL, PyG)
Production tips: - Pre-compute embeddings for static graphs - Use approximate nearest neighbor for inference - Cache frequently accessed node features
Q: Common GNN datasets и benchmarks?¶
A:
Node Classification: - Cora, Citeseer, Pubmed: Citation networks (papers = nodes, citations = edges) - PPI: Protein-protein interaction (inductive setting) - OGBN-Arxiv: 1.7M papers, large-scale benchmark
Graph Classification: - QM9: Molecular properties (130K molecules) - PROTEINS: Protein function prediction - OGBG-MolHIV: Molecule HIV inhibition
Link Prediction: - ogbl-ddi: Drug-drug interactions - ogbl-collab: Author collaboration
Knowledge Graphs: - FB15k-237: Freebase subset - WN18RR: WordNet relations
Libraries: PyTorch Geometric (PyG), DGL, OGB (Open Graph Benchmark)
Reinforcement Learning¶
Источники: Analytics Vidhya Actor-Critic, Index.dev RL Questions, Stable-Baselines3 docs
Q: Value-based vs Policy-based vs Actor-Critic -- в чём разница?¶
A:
| Approach | Learns | Pros | Cons |
|---|---|---|---|
| Value-based (Q-learning, DQN) | \(Q(s,a)\) = value of action | Sample efficient | Continuous action hard |
| Policy-based (REINFORCE) | \(\pi_\theta(a\|s)\) directly | Works with continuous | High variance |
| Actor-Critic | Both \(Q(s,a)\) + \(\pi_\theta(a\|s)\) | Best of both | More complex |
When to use: - Value-based: Discrete actions, need sample efficiency (Atari games) - Policy-based: Continuous actions, need stability (robotics) - Actor-Critic: General purpose, best of both worlds
class ActorCritic(nn.Module):
def __init__(self, obs_dim, act_dim):
super().__init__()
self.actor = nn.Sequential(
nn.Linear(obs_dim, 64), nn.ReLU(),
nn.Linear(64, act_dim), nn.Softmax(dim=-1)
)
self.critic = nn.Sequential(
nn.Linear(obs_dim, 64), nn.ReLU(),
nn.Linear(64, 1)
)
def forward(self, x):
return self.actor(x), self.critic(x)
Q: Q-Learning -- как работает?¶
A:
Core idea: Learn Q(s,a) = expected cumulative reward from state s, action a.
Bellman equation: $\(Q^*(s,a) = r + \gamma \max_{a'} Q^*(s', a')\)$
Update rule: $\(Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]\)$
Where: - \(\alpha\) = learning rate - \(\gamma\) = discount factor (0.99 typical) - \(r\) = immediate reward - \(\max_{a'} Q(s',a')\) = best future value
Tabular Q-learning:
Q = np.zeros((n_states, n_actions))
for episode in range(n_episodes):
state = env.reset()
while not done:
# epsilon-greedy action selection
if random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
next_state, reward, done = env.step(action)
# Q-learning update
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
state = next_state
Q: Deep Q-Network (DQN) -- ключевые инновации?¶
A:
Problem: Tabular Q-learning doesn't scale to large/continuous state spaces.
Solution: Neural network approximates Q(s,a;theta).
Two key innovations:
1. Experience Replay:
class ReplayBuffer:
def __init__(self, capacity=100000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
return random.sample(self.buffer, batch_size)
Why: Breaks correlation between consecutive samples, improves data efficiency.
2. Target Network:
# Separate network for stable targets
if step % target_update_freq == 0:
target_network.load_state_dict(policy_network.state_dict())
Why: Prevents oscillations from chasing a moving target.
DQN Loss: $\(L(\theta) = \mathbb{E}_{(s,a,r,s')} \left[ \left( r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta) \right)^2 \right]\)$
Q: Double DQN vs Dueling DQN?¶
A:
Double DQN: Addresses overestimation bias.
Standard DQN: \(\max_{a'} Q(s',a')\) tends to overestimate.
DDQN solution: Decouple action selection from evaluation. $\(Y^{\text{DDQN}} = r + \gamma Q\left(s', \arg\max_{a'} Q(s',a';\theta); \theta^-\right)\)$
Online network chooses action, target network evaluates it.
Dueling DQN: Decomposes Q into value + advantage.
Where: - \(V(s)\) = value of being in state s - \(A(s,a)\) = advantage of taking action a vs average
When: Useful when action choice doesn't affect value much in some states.
Q: Policy Gradient Theorem -- суть?¶
A:
Goal: Maximize expected return \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \gamma^t r_t]\).
Theorem: Gradient can be computed without knowing environment dynamics. $\(\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]\)$
Where: - \(G_t = \sum_{k=t}^T \gamma^{k-t} r_k\) = return from timestep t - \(\nabla_\theta \log \pi_\theta(a_t|s_t)\) = gradient of log-probability
Intuition: - Increase probability of actions that led to high returns - Decrease probability of actions that led to low returns
Q: REINFORCE algorithm -- как работает?¶
A:
Vanilla policy gradient:
def train_reinforce(policy, optimizer, trajectories, gamma=0.99):
policy_loss = []
for states, actions, rewards in trajectories:
# Compute discounted returns
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns)
# Normalize (reduces variance)
returns = (returns - returns.mean()) / (returns.std() + 1e-9)
# Compute policy gradient
for state, action, G in zip(states, actions, returns):
probs = policy(state)
log_prob = torch.log(probs[action])
policy_loss.append(-log_prob * G) # Negative for gradient ascent
optimizer.zero_grad()
loss = torch.stack(policy_loss).sum()
loss.backward()
optimizer.step()
Problem: High variance in returns.
Solution: Use advantage function \(A(s,a) = Q(s,a) - V(s)\) as baseline.
Q: PPO (Proximal Policy Optimization) -- почему популярен?¶
A:
Problem: Policy gradient can make destructively large updates.
PPO solution: Clip the policy update to stay close to old policy.
Clipped objective: $\(L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]\)$
Where: - \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) = probability ratio - \(\hat{A}_t\) = estimated advantage - \(\epsilon\) = clipping parameter (0.2 typical)
Why popular: - Stable training (no huge policy jumps) - Works for discrete and continuous actions - Simple to implement - Good default choice for most problems
from stable_baselines3 import PPO
model = PPO('MlpPolicy', env, learning_rate=3e-4,
n_steps=2048, batch_size=64, n_epochs=10, clip_range=0.2)
model.learn(total_timesteps=100000)
Q: Actor-Critic методы -- что это?¶
A:
Idea: Combine value-based and policy-based methods.
- Actor: Policy pi_theta(a|s) -- chooses actions
- Critic: Value function V_phi(s) -- evaluates states
Advantage estimation: $\(A(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\)$
Loss: $\(L = L_{\text{actor}} + c_1 L_{\text{critic}} + c_2 L_{\text{entropy}}\)$
Variants: - A2C (Advantage Actor-Critic): Synchronous, single worker - A3C: Asynchronous, multiple workers - SAC (Soft Actor-Critic): Off-policy, entropy regularization, continuous actions - TD3 (Twin Delayed DDPG): Twin critics, delayed updates, continuous actions
Q: SAC (Soft Actor-Critic) -- ключевые особенности?¶
A:
Objective: Maximize reward + entropy. $\(J(\pi) = \mathbb{E}_\tau \left[ \sum_t \gamma^t (r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))) \right]\)$
Where \(\mathcal{H}(\pi) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s)]\) is entropy.
Key features: 1. Entropy bonus: Encourages exploration, prevents premature convergence 2. Off-policy: Uses replay buffer, sample efficient 3. Automatic temperature: alpha tuned automatically
When SAC: - Continuous control (robotics, mujoco) - Need sample efficiency - Want robust exploration
from stable_baselines3 import SAC
model = SAC('MlpPolicy', env, learning_rate=3e-4,
buffer_size=1000000, batch_size=256, ent_coef='auto')
Q: Когда какой RL алгоритм использовать?¶
A:
Decision tree:
Discrete actions?
+-- Yes -> Simple environment?
| +-- Yes -> DQN
| +-- No -> PPO
+-- No (Continuous) -> Need sample efficiency?
+-- Yes -> SAC (or TD3)
+-- No -> PPO
| Algorithm | Actions | Sample Eff. | Stability | Use Case |
|---|---|---|---|---|
| DQN | Discrete | High | Medium | Atari, games |
| PPO | Both | Medium | High | General purpose, robotics |
| SAC | Continuous | High | High | Continuous control, robotics |
| TD3 | Continuous | High | High | Faster than SAC, less tuning |
Practical tips: 1. Start with PPO -- most robust 2. Use Stable Baselines3 -- don't reinvent 3. Monitor: episode reward, value loss, entropy 4. Tune: learning rate, batch size first
Q: Exploration-Exploitation trade-off в RL?¶
A:
Problem: Balance exploring new states vs exploiting known good actions.
Approaches:
1. epsilon-greedy (DQN):
epsilon = max(epsilon_end, epsilon_start * decay**step)
if random.random() < epsilon:
action = random_action() # Explore
else:
action = argmax(Q(state)) # Exploit
2. Entropy regularization (SAC, PPO): $\(L = L_{\text{task}} - \alpha \mathcal{H}(\pi)\)$
Higher entropy = more exploration.
# Entropy regularization in PPO
entropy = -torch.sum(probs * torch.log(probs), dim=-1)
loss = policy_loss - entropy_coef * entropy.mean() # Maximize entropy
3. Curiosity-driven: Add intrinsic reward for visiting novel states. $\(r_{\text{total}} = r_{\text{external}} + \eta \cdot r_{\text{intrinsic}}\)$
Q: Reward Hacking -- что это и как избежать?¶
A:
Problem: Agent finds unintended ways to maximize reward.
Examples: - Boat racing: Spins in circles to collect power-ups - Coast runners: Drives backwards for higher score - Video games: Pauses to avoid losing
Solutions: 1. Careful reward design: Align reward with true objective 2. Reward shaping: Potential-based shaping doesn't change optimal policy 3. Multiple objectives: Combine several metrics 4. Human oversight: Regular evaluation of agent behavior 5. Inverse RL: Learn reward from demonstrations
Potential-based shaping: $\(F(s,a,s') = \gamma \Phi(s') - \Phi(s)\)$
This guarantees optimal policy unchanged.