RewardGuard Blog

Research, Insights &
AI Alignment Deep-Dives

Tutorials, case studies, and research notes on reward hacking, RL alignment, and building safer AI systems.

Why Your RL Agent Is Cheating (And How to Catch It)

Every reinforcement learning agent has one goal: maximize its reward. The problem is that agents are extraordinarily creative at finding ways to score high that have nothing to do with what you actually wanted. We call this reward hacking — and it's more common than you think.

Giovan Ruiz March 18, 2026 12 min read

🏆

Competition Free Entry

Clash Royale RL Championship — 3,000,000 Credits Prize Pool

Train a Clash Royale RL agent with RewardGuard and compete for 3M credits and an official certificate. Free entry. Competition runs May 30 – June 30, 2026.

Giovan Ruiz Apr 12, 2026 7 min read

📝

Tutorial Free Premium

Logging Reward Changes Mid-Training: Free & Premium Guide

A complete walkthrough of both tiers — rolling-window balance checks with the free Monitor, and per-step correction logs, CSV export, and WandB/TensorBoard callbacks with AutoMonitor.

Giovan Ruiz Apr 8, 2026 10 min read

⚖️

Alignment

The Survival vs. Food Trade-off: A Case Study in Reward Imbalance

Using a simple snake environment, we show how a single miscalibrated reward coefficient can cause an agent to converge on the wrong strategy entirely — and how to detect it before it derails your model.

Giovan Ruiz Mar 5, 2026 8 min read

🤖

RLHF

RLHF Pitfalls: When Human Feedback Creates Bad Incentives

Reinforcement Learning from Human Feedback is powerful — but it introduces its own alignment risks. We explore how models learn to game human raters and what monitoring can catch it early.

Giovan Ruiz Feb 20, 2026 10 min read

🔬

Research

Reward Balance Scores: How RewardGuard Quantifies Misalignment

Behind the scenes of RewardGuard's detection engine — how we compute reward ratios, establish dynamic thresholds, and assign confidence scores to detected anomalies.

Giovan Ruiz Feb 8, 2026 9 min read

📊

Tutorial

Getting Started with RewardGuard: Your First Training Run Audit

A step-by-step walkthrough for integrating RewardGuard into an existing PyTorch training loop. From installation to your first misalignment report in under 10 minutes.

Giovan Ruiz Jan 24, 2026 6 min read

🧠

Theory

Goodhart's Law and the RL Agent: Why Metrics Fail Under Optimization

"When a measure becomes a target, it ceases to be a good measure." We examine how Goodhart's Law manifests in modern RL training and what it means for reward function design.

Giovan Ruiz Jan 10, 2026 11 min read

🛡️

Open Source

Why We Open-Sourced the Detection Layer

We believe safety tooling should be accessible to everyone. Here's our thinking behind making RewardGuard's core detection engine MIT-licensed — and what stays in the premium tier.

Giovan Ruiz Dec 18, 2025 5 min read

Research, Insights &AI Alignment Deep-Dives