RewardGuard Documentation

Documentation for the rewardguard (free) and rewardguard-premium packages. Both packages detect reward component imbalances in reinforcement learning training loops. The premium package adds statistical detection, automatic weight correction, and a full per-step audit trail.

rewardguard — Free (MIT)

Rolling-window balance analysis
Per-component imbalance detection
Suggested weight multipliers
Log-file post-hoc analysis
Matplotlib visualization

rewardguard-premium — Proprietary

Statistical z-score detection
Continuous 0–1 alignment score
Automatic reward weight correction
Full timestamped correction log
CSV / JSON export, save & resume
WandB, TensorBoard, SB3 callbacks

Installation

Free package

Available on PyPI, no license required.

pip install rewardguard

Premium package

Install via PyPI after purchasing at rewardguard.dev/premium. The package authenticates with your RewardGuard account at runtime — no extra index needed.

pip install rewardguard-premium

The premium package depends on the free package. Import premium features from rewardguard_premium.

Authentication

After installing, sign in once with rewardguard-premium login. Your session is saved to ~/.rewardguard/session.json and refreshed automatically — no sign-in needed on subsequent runs.

Quick Start

Free — live in-loop monitoring

import rewardguard as rg

monitor = rg.Monitor(
    expected={"task": 0.7, "safety": 0.3},
    tolerance=5.0,
    window=200,
)

for episode in range(num_episodes):
    for step in range(max_steps):
        r_task, r_safety = env.step(action)
        monitor.step({"task": r_task, "safety": r_safety})

monitor.print_report()

Premium — auto-correction

from rewardguard_premium import AutoMonitor

monitor = AutoMonitor(
    expected={"task": 0.7, "safety": 0.3},
    baseline_steps=3000,
    auto_correct=True,
)

for step_idx in range(total_steps):
    rewards = env.step(action)
    snapshot = monitor.step(rewards)

    if snapshot and snapshot.flag == "critical":
        env.set_reward_weights(monitor.weights)

monitor.save("run_state.json")
monitor.to_csv("audit.csv")

Monitor Free

The primary live in-loop API. Zero external dependencies. Drop one monitor.step() call inside your training loop — call check() or print_report() whenever you want an analysis.

import rewardguard as rg
monitor = rg.Monitor(...)

Constructor

Parameter	Type	Default	Description
expected	Dict[str, float]	required	Target distribution of reward components. Values are relative weights — they are normalized to percentages automatically. Example: `{"task": 3, "safety": 1}` → task 75%, safety 25%.
tolerance	float	5.0	Percentage-point tolerance before a component is flagged as imbalanced (±pp). A difference of ≤ tolerance is "ok"; ≤ 3× tolerance is "warning"; > 3× is "critical".
window	int	200	Number of recent steps used for rolling analysis. Older steps are retained in memory up to `max_history`.
max_history	int	100 000	Hard cap on stored steps. Implemented as a deque — oldest entries are evicted in O(1) once the cap is reached.

Validation

All expected weights must be ≥ 0 and finite, and must sum to a positive number. tolerance and window must each be > 0. Violations raise ValueError at construction time.

monitor.step(rewards, episode_done=False)

Record reward components for one environment step. Call once per step, every step.

Parameter	Type	Description
rewards	Dict[str, float]	Component values for this step, e.g. `{"task": 1.2, "safety": -0.1}`. All values must be finite — NaN or Inf raises `ValueError` immediately.
episode_done	bool	Unused in the free tier; accepted so code written against the premium `AutoMonitor` API works unchanged when downgraded.

monitor.step({"task": r_task, "safety": r_safety})

monitor.check() → AnalysisResult

Compute a balance analysis over the current rolling window. Returns an AnalysisResult. Does not modify state — safe to call as often as needed.

result = monitor.check()

# Overall severity: "ok" / "warning" / "critical"
print(result.severity)

# Per-component details
for comp, info in result.imbalance_report.items():
    print(f"{comp}: {info['real']:.1f}% real vs {info['expected']:.1f}% expected → {info['recommendation']}")

# Suggested weight multipliers to rebalance
print(result.suggested_reward_weights)

Raises

ValueError if no steps have been recorded yet.

monitor.print_report()

Print a formatted balance table to stdout. Equivalent to calling check() and passing the result to RewardGuard().print_analysis_report().

Example output

============================================================ REWARDGUARD ANALYSIS REPORT *** OVERALL SEVERITY: WARNING *** ============================================================ Episodes analyzed : 8470 Sources found : safety, task Source Real % Expected % Diff Severity --------------- ---------- ------------ ---------- -------- task 52.3 70.0 -17.7 WARNING safety 47.7 30.0 +17.7 WARNING Suggested weight multipliers: safety: 0.84x <-- ADJUST task: 1.05x <-- ADJUST Actions needed: • task: Increase weight by ~17.7% • safety: Decrease weight by ~17.7% ============================================================

monitor.reset()

Clear all accumulated history without changing the configuration. step_count is reset to 0. Useful between distinct training phases.

monitor.reset()
print(monitor.step_count)  # → 0

Properties

Property	Type	Description
step_count	int	Total steps recorded since creation or last reset.
expected	Dict[str, float]	Normalized target percentage distribution (sums to 100). Read-only copy.

AnalysisResult Free

Returned by monitor.check() and analyze_balance(). A dataclass — all fields are read-only after construction.

Field	Type	Description
real_percentages	Dict[str, float]	Observed percentage share of each component over the rolling window, measured by reward magnitude.
expected_percentages	Dict[str, float]	Target percentage distribution (normalized, sums to 100).
imbalance_report	Dict[str, Dict]	Per-component breakdown. Each inner dict has keys: `real`, `expected`, `difference`, `abs_difference`, `status` ("balanced"/"imbalanced"), `severity` ("ok"/"warning"/"critical"), `recommendation` (string), `unexpected` (bool).
suggested_reward_weights	Dict[str, float]	Recommended multipliers to rebalance components. Values > 1.0 mean increase that component's weight; < 1.0 means decrease. Clamped to [0.1, 5.0]. A value of 5.0 signals reward hacking (component completely absent).
episode_count	int	Number of steps (or episodes, for the log-based API) included in the analysis.
sources_found	List[str]	Sorted list of component names seen in the real data.
severity	str	Overall severity: `"ok"`, `"warning"`, or `"critical"`. Elevated to the worst component severity.
unexpected_sources	List[str]	Components present in real data but absent from `expected`. These are flagged but not penalized.

Severity thresholds

Severity	Condition
ok	abs_difference ≤ tolerance
warning	tolerance < abs_difference ≤ 3 × tolerance
critical	abs_difference > 3 × tolerance, or component completely absent

Log-based API Free

For post-hoc analysis of training log files rather than live in-loop monitoring.

Module-level convenience functions

rg.parse_logs(raw_text) → List[EpisodeData]

Parse a multi-episode training log from a raw string. Each episode block must start with a header matching Ep <N> | <STATUS> | reward=<value>. Returns a list of EpisodeData objects.

episodes = rg.parse_logs(open("training.log").read())

rg.analyze_balance(parsed_data, expected_percentages) → AnalysisResult

Analyze balance between observed and expected distributions across a list of episodes.

Parameter	Type	Description
parsed_data	List[EpisodeData]	Output of `parse_logs()`.
expected_percentages	Dict[str, float]	Target percentage per source. Need not sum to 100 — will be normalized.

episodes = rg.parse_logs(open("training.log").read())
result   = rg.analyze_balance(episodes, expected_percentages={"task": 60, "safety": 40})
rg.RewardGuard().print_analysis_report(result)

rg.recommend_weights(real_percentages, expected_percentages) → Dict[str, float]

Standalone function to compute suggested weight multipliers from two distributions, without needing episode data.

RewardGuard class

The log-based facade that wraps LogParser and RewardAnalyzer. Useful when you need stateful parsing across multiple calls.

rg_instance = rg.RewardGuard(tolerance=5.0)
episodes = rg_instance.parse_logs(raw_text)
result   = rg_instance.analyze_balance(episodes, expected_percentages={"task": 60, "safety": 40})
rg_instance.print_analysis_report(result)

plot_balance() Free

Render a two-panel matplotlib figure from an AnalysisResult. Requires pip install matplotlib.

Left panel: grouped horizontal bar chart showing real vs expected percentage per component, bars colored by severity. Right panel: suggested weight multipliers with a dashed "no change" line at 1.0.

from rewardguard import plot_balance

plot_balance(monitor.check())

# Save headlessly (no display)
plot_balance(monitor.check(), save_path="report.png", show=False)

Parameter	Type	Default	Description
result	AnalysisResult	required	Output of `monitor.check()` or `analyze_balance()`.
title	str \| None	None	Optional chart title. Defaults to a summary string including step count and severity.
save_path	str \| None	None	If provided, save the figure to this path (e.g. `"report.png"`).
show	bool	True	Call `plt.show()` after rendering. Set to `False` when saving headlessly in a CI system.

AutoMonitor Premium

Drop-in superset of Monitor. Every free-tier method works unchanged. Extends it with baseline learning, z-score detection, a continuous alignment score, automatic weight correction, framework callbacks, and export/persistence.

from rewardguard_premium import AutoMonitor

Constructor

Accepts all Monitor parameters plus the following:

Parameter	Type	Default	Description
baseline_steps	int	300	Steps collected before z-score detection activates. During warm-up, `step()` returns `None`.
z_threshold	float \| Dict[str, float]	2.5	Z-score threshold at which the alignment score equals 0.5 and the component is flagged. Pass a single float to apply the same threshold to all components, or a dict of `{"component": threshold}` for per-component sensitivity.
sigmoid_steepness	float	1.2	Controls how sharply the 0–1 alignment score drops around `z_threshold`.
auto_correct	bool	True	If `True`, automatically adjust `weights` when a component is flagged and the confidence window is satisfied.
correction_rate	float	0.2	Fraction of the required correction applied per auto-correct call. Lower → smoother but slower convergence.
correction_rate_decay	float	0.0	Amount by which `correction_rate` is reduced after each correction. Set > 0 to make corrections progressively more conservative over time.
min_confidence_steps	int	50	Minimum post-baseline steps before the first automatic correction is allowed.
drift_window	int	30	Number of recent snapshots used to estimate drift velocity (slope of alignment score).
starvation_window	int	20	Consecutive steps a component must be near-zero to trigger a starvation alert.
starvation_threshold	float	1.0	Absolute value below which a component's reward is considered "near-zero" for starvation detection purposes.
callbacks	List[Callable]	[]	Callables invoked with the `AlignmentSnapshot` after each post-baseline step. See Callbacks.

monitor.step(rewards) → AlignmentSnapshot | None

Same signature as the free step(). During the first baseline_steps steps, returns None — the monitor is learning the baseline distribution. After warm-up, returns an AlignmentSnapshot every call.

for step_idx in range(total_steps):
    action = policy.act(state)
    next_state, _, done, info = env.step(action)

    snapshot = monitor.step({
        "task":   info["task_reward"],
        "safety": info["safety_reward"],
    })

    # snapshot is None during baseline warm-up
    if snapshot is not None:
        if snapshot.flag == "critical":
            env.set_reward_weights(monitor.weights)

    state = next_state if not done else env.reset()

Properties

Property	Type	Description
weights	Dict[str, float]	Current reward weight multipliers. Start at 1.0; drift as auto-correction runs. Always read this — never a stale copy — when applying weights to your environment.
alignment_score	float	Most recent alignment score in [0, 1]. Shortcut for `monitor.snapshots[-1].alignment_score`. Returns 1.0 during warm-up since no deviation has been detected yet.
is_baseline_complete	bool	`True` once the baseline warm-up window has been filled and z-score detection is active.
snapshots	List[AlignmentSnapshot]	All alignment snapshots produced since the baseline warm-up completed. The full per-step audit trail.
step_count	int	Inherited from `Monitor`. Total steps recorded.

AlignmentSnapshot Premium

A point-in-time alignment measurement produced by AutoMonitor.step() after the baseline warm-up completes. Timestamped by global step index.

Field	Type	Description
step	int	Global step index when this snapshot was taken.
alignment_score	float	0.0 (fully misaligned) → 1.0 (fully aligned). Sigmoid-mapped from the maximum per-component z-score.
component_ratios	Dict[str, float]	Rolling-window percentage share of each component at this step.
z_scores	Dict[str, float]	Per-component deviation from the learned baseline in standard deviations. Values > `z_threshold` trigger flagging.
drift_velocity	float	Slope of `alignment_score` over the last `drift_window` snapshots. Negative = worsening trend; positive = recovering.
flag	str	`"ok"` / `"warning"` / `"critical"` based on max z-score vs threshold.
corrections_applied	Dict[str, float]	Mapping of component → new weight for any automatic corrections applied at this step. Empty dict if no correction was made.
starvation_alerts	List[str]	Components that have been near-zero for `starvation_window` consecutive steps — a strong reward-hacking signal.

# Iterate the correction history after training
for snap in monitor.snapshots:
    if snap.corrections_applied:
        print(
            f"step={snap.step:6d}  score={snap.alignment_score:.3f}  "
            f"flag={snap.flag:<8s}  corrections={snap.corrections_applied}"
        )

Example output

step= 4120 score=0.431 flag=warning corrections={'safety': 1.0460} step= 5300 score=0.318 flag=critical corrections={'safety': 1.0955, 'task': 0.9740} step= 6480 score=0.402 flag=critical corrections={'safety': 1.1243} step= 9400 score=0.693 flag=warning corrections={'safety': 1.0412} step= 11040 score=0.812 flag=ok corrections={}

snapshot.to_dict()

Serialize to a plain dict suitable for JSON encoding.

Exports Premium

Three methods give you the complete per-step record for downstream analysis, dashboards, or CI artifacts.

monitor.to_csv(path=None) → str

Returns the full snapshot history as a CSV string. Optionally writes to disk. Columns: step, alignment_score, flag, drift_velocity, starvation_alerts, then ratio_<comp> and z_<comp> for every component.

csv_str = monitor.to_csv("audit_run_42.csv")

CSV preview

step,alignment_score,flag,drift_velocity,starvation_alerts,ratio_safety,z_safety,ratio_task,z_task 301,0.983241,ok,0.000000,,29.84,-0.1234,70.16,0.1234 302,0.981002,ok,0.000000,,30.12,-0.0981,69.88,0.0981 ... 412,0.431008,warning,-0.002341,,47.23,2.6801,52.77,-2.6801

monitor.to_json(path=None) → str

Returns full state as a JSON string — including baseline statistics, current weights, and all snapshots. Optionally writes to disk.

json_str = monitor.to_json("audit_run_42.json")

monitor.print_report()

Prints the free-tier balance table plus the premium alignment state: current alignment score, drift velocity, current weights, and per-component z-scores.

Save & Resume Premium

Long runs can be checkpointed and resumed. The saved state includes baseline statistics, all snapshots, and the current weights — a resumed run continues seamlessly from where it left off.

# Save at any checkpoint
monitor.save("run_42_state.json")

# Resume in a new process
monitor = AutoMonitor.load("run_42_state.json")
print(monitor.step_count)      # picks up where it left off
print(monitor.weights)         # previously learned weights
print(len(monitor.snapshots))  # all historical snapshots intact

Override on load

Pass any constructor keyword argument to AutoMonitor.load() to override a saved setting — for example AutoMonitor.load("state.json", auto_correct=False) to replay the saved history in read-only mode.

Callbacks Premium

Pass a list of callables to the callbacks constructor parameter. Each callback is invoked with the AlignmentSnapshot after every post-baseline step. Three built-in factory functions are provided.

Weights & Biases

import wandb
from rewardguard_premium import AutoMonitor, make_wandb_callback

wandb.init(project="my-rl-run")

monitor = AutoMonitor(
    expected={"task": 0.7, "safety": 0.3},
    callbacks=[make_wandb_callback()],
)

Logs rewardguard/alignment_score, rewardguard/drift_velocity, rewardguard/ratio/<comp>, and rewardguard/z_score/<comp> at each step.

TensorBoard

from torch.utils.tensorboard import SummaryWriter
from rewardguard_premium import AutoMonitor, make_tensorboard_callback

writer = SummaryWriter("runs/my_run")

monitor = AutoMonitor(
    expected={"task": 0.7, "safety": 0.3},
    callbacks=[make_tensorboard_callback(writer)],
)

Stable-Baselines3

The SB3 callback reads info["reward_components"] from each environment step. Your environment must include this key.

from stable_baselines3 import PPO
from rewardguard_premium import AutoMonitor, make_sb3_callback

monitor = AutoMonitor(expected={"task": 0.7, "safety": 0.3})
cb = make_sb3_callback(monitor)

model = PPO("MlpPolicy", env)
model.learn(total_timesteps=5_000_000, callback=cb)

monitor.to_csv("sb3_run.csv")

Custom callback

Any callable that accepts an AlignmentSnapshot works.

def my_callback(snapshot):
    if snapshot.corrections_applied:
        my_logger.info(
            "step=%d corrections=%s",
            snapshot.step, snapshot.corrections_applied,
        )

monitor = AutoMonitor(
    expected={"task": 0.7, "safety": 0.3},
    callbacks=[my_callback],
)

plot_session() Premium

Render a multi-panel timeline figure from a completed AutoMonitor session. Requires pip install matplotlib.

from rewardguard_premium import plot_session

plot_session(monitor)

# Save headlessly
plot_session(monitor, save_path="session_report.png", show=False)

CI/CD Integration

Fail a training run automatically when reward hacking is detected above a severity threshold. The AnalysisResult.severity string is the simplest gate; for more precision use the alignment score from the premium tier.

Free — severity gate

result = monitor.check()

if result.severity == "critical":
    raise RuntimeError(
        f"Training aborted: reward hacking detected"
    )

Premium — alignment score gate

# Check after each evaluation epoch
if monitor.snapshots:
    latest = monitor.snapshots[-1]
    if latest.alignment_score < 0.4:
        monitor.to_csv("ci_artifact_failed_run.csv")
        raise RuntimeError(
            f"Training aborted: alignment score {latest.alignment_score:.2f} below threshold"
        )

Authentication Premium

The premium package authenticates with your RewardGuard account at runtime — not at import time. This allows offline imports, type-checking, and unit tests without credentials.

Option 1 — Interactive sign-in (recommended)

Run once before using the package. The session is saved and refreshed automatically.

rewardguard-premium login

Option 2 — Environment credentials

export REWARDGUARD_EMAIL='you@example.com'
export REWARDGUARD_PASSWORD='...'

Offline grace period

If the package cannot reach the server, it runs in offline mode for up to 24 hours from the last successful verification. After that, a LicenseError is raised on the next AutoMonitor instantiation.

Checking status / signing out

rewardguard-premium status   # show current session
rewardguard-premium logout   # sign out and remove session

from rewardguard_premium import clear_session
clear_session()  # clear the in-process cache and remove session file