Getting Started with RewardGuard

RewardGuard is designed to drop into existing training workflows with minimal friction. You don't need to change your model architecture, your optimizer, or your reward function. You just need to tell RewardGuard what your reward components are and log them at each step — it handles the rest.

This tutorial uses a simple PyTorch RL loop, but the same pattern works with JAX, any gym-compatible environment, and Stable Baselines 3.

Install the Package

The free package is open-source and available on PyPI:

pip install rewardguard

For the premium package (auto-adjustment), you'll need a license key from your dashboard:

pip install rewardguard-premium

Identify Your Reward Components

Before adding monitoring, identify the distinct components that make up your reward signal. If your reward function looks like this:

def compute_reward(state, action, next_state):
    survival = 1.0  # always positive while alive
    goal_dist = -0.1 * distance_to_goal(next_state)
    food_bonus = 10.0 if reached_food(next_state) else 0.0
    death_pen  = -50.0 if is_terminal(next_state) else 0.0
    return survival + goal_dist + food_bonus + death_pen

The components are survival, goal_dist, food_bonus, and death_pen. RewardGuard needs each of these separately — not just the total.

Initialize the Monitor

import rewardguard as rg

monitor = rg.Monitor(
    components=["survival", "goal_dist", "food_bonus", "death_penalty"],
    window=500,           # analysis window in steps
    primary="food_bonus",  # the component that should dominate
    threshold=8.0,         # alert ratio (passive/primary)
    confidence=0.90,       # minimum confidence to flag
)

The primary parameter tells RewardGuard which component represents genuine task progress. Components with higher accumulated reward than the primary component by the threshold factor will trigger an alert.

Log Components in Your Training Loop

Add a single logging call inside your step loop:

for episode in range(num_episodes):
    state = env.reset()
    done = False

    while not done:
        action = policy.act(state)
        next_state, _, done, info = env.step(action)

        # Compute your reward components
        survival   = 1.0 if not done else 0.0
        food_bonus = info.get("food_collected", 0) * 10.0
        goal_dist  = -0.1 * info["dist_to_goal"]
        death_pen  = -50.0 if done else 0.0

        # Log to RewardGuard (one extra line)
        monitor.log(
            survival=survival, food_bonus=food_bonus,
            goal_dist=goal_dist, death_penalty=death_pen
        )

        state = next_state

    # Check for issues every episode
    report = monitor.analyze()
    if report.detects_hacking():
        print(report.summary())
        break

Reading the Report

When RewardGuard detects a problem, the summary looks like this:

RewardGuard v2.1.0 — Analysis Report ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⚠ REWARD HACKING DETECTED Confidence: 97.3% Window: steps 340–840 Component Ratios (vs. primary: food_bonus) ───────────────────────────────────────── survival accumulated: 482.0 ratio: 48.2:1 ← EXPLOIT goal_dist accumulated: -12.4 ratio: n/a food_bonus accumulated: 10.0 ratio: 1.0 Diagnosis: Agent is farming survival reward without engaging the primary objective (food_bonus). The survival component is unbounded relative to food_bonus. Suggested Fix: ↓ Reduce survival weight OR ↑ Increase food_bonus weight Target ratio: survival/food_bonus ≤ 8.0:1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Export a Full Report (Optional)

# Export to JSON for logging/CI integration
report.export("audit_run_42.json")

# Export to PDF for sharing with your team
report.export("audit_run_42.pdf", format="pdf")

Premium: Auto-Adjustment

With a premium license, replace rg.Monitor with rg.PremiumMonitor and add auto_adjust=True. When hacking is detected, the monitor will automatically rebalance your reward weights without stopping training. The adjustment is logged to the report.

Integrating with CI/CD

For production workflows, you want monitoring to fail the run automatically if reward hacking is detected above a severity threshold. The report object exposes a severity score from 0 to 1:

report = monitor.analyze()

if report.severity > 0.8:
    raise RuntimeError(
        f"Training aborted: reward hacking severity {report.severity:.2f}"
    )

Add this check after each evaluation step in your training loop, and your CI system will catch reward hacking before the run completes — saving compute and giving you a clear signal about what went wrong.

That's it. Three objects (Monitor, log(), analyze()), one extra line per training step, and you have continuous reward balance monitoring integrated into your existing loop. The free package gives you detection and diagnosis. The premium package closes the loop with automatic correction.

Install the Package

Identify Your Reward Components

Initialize the Monitor

Log Components in Your Training Loop

Reading the Report

Export a Full Report (Optional)

Integrating with CI/CD

Continue Reading

Why Your RL Agent Is Cheating (And How to Catch It)

The Survival vs. Food Trade-off

Reward Balance Scores: How RewardGuard Quantifies Misalignment