RewardGuard Documentation

API reference for the rewardguard (free) and rewardguard-premium packages. Both packages detect reward component imbalances in reinforcement learning training loops. The premium package adds statistical detection, automatic weight correction, and a full per-step audit trail.

rewardguard — Free (MIT)
  • Rolling-window balance analysis
  • Per-component imbalance detection
  • Suggested weight multipliers
  • Log-file post-hoc analysis
  • Matplotlib visualization
rewardguard-premium — Proprietary
  • Statistical z-score detection
  • Continuous 0–1 alignment score
  • Automatic reward weight correction
  • Full timestamped correction log
  • CSV / JSON export, save & resume
  • WandB, TensorBoard, SB3 callbacks

Installation

Free package

Available on PyPI, no license required.

pip install rewardguard

Premium package

Install via PyPI after purchasing at rewardguard.dev/premium. The package authenticates with your RewardGuard account at runtime — no extra index needed.

pip install rewardguard-premium

The premium package depends on the free package. Import premium features from rewardguard_premium.

Authentication

After installing, sign in once with rewardguard-premium login. Your session is saved to ~/.rewardguard/session.json and refreshed automatically — no sign-in needed on subsequent runs.

Quick Start

Free — live in-loop monitoring

import rewardguard as rg monitor = rg.Monitor( expected={"task": 0.7, "safety": 0.3}, tolerance=5.0, window=200, ) for episode in range(num_episodes): for step in range(max_steps): r_task, r_safety = env.step(action) monitor.step({"task": r_task, "safety": r_safety}) monitor.print_report()

Premium — auto-correction

from rewardguard_premium import AutoMonitor monitor = AutoMonitor( expected={"task": 0.7, "safety": 0.3}, baseline_steps=3000, auto_correct=True, ) for step_idx in range(total_steps): rewards = env.step(action) snapshot = monitor.step(rewards) if snapshot and snapshot.flag == "critical": env.set_reward_weights(monitor.weights) monitor.save("run_state.json") monitor.to_csv("audit.csv")

Monitor Free

The primary live in-loop API. Zero external dependencies. Drop one monitor.step() call inside your training loop — call check() or print_report() whenever you want an analysis.

import rewardguard as rg monitor = rg.Monitor(...)

Constructor

ParameterTypeDefaultDescription
expectedDict[str, float]requiredTarget distribution of reward components. Values are relative weights — they are normalized to percentages automatically. Example: {"task": 3, "safety": 1} → task 75%, safety 25%.
tolerancefloat5.0Percentage-point tolerance before a component is flagged as imbalanced (±pp). A difference of ≤ tolerance is "ok"; ≤ 3× tolerance is "warning"; > 3× is "critical".
windowint200Number of recent steps used for rolling analysis. Older steps are retained in memory up to max_history.
max_historyint100 000Hard cap on stored steps. Implemented as a deque — oldest entries are evicted in O(1) once the cap is reached.
Validation

All expected weights must be ≥ 0 and finite, and must sum to a positive number. tolerance and window must each be > 0. Violations raise ValueError at construction time.

monitor.step(rewards, episode_done=False)

Record reward components for one environment step. Call once per step, every step.

ParameterTypeDescription
rewardsDict[str, float]Component values for this step, e.g. {"task": 1.2, "safety": -0.1}. All values must be finite — NaN or Inf raises ValueError immediately.
episode_doneboolUnused in the free tier; accepted so code written against the premium AutoMonitor API works unchanged when downgraded.
monitor.step({"task": r_task, "safety": r_safety})

monitor.check() → AnalysisResult

Compute a balance analysis over the current rolling window. Returns an AnalysisResult. Does not modify state — safe to call as often as needed.

result = monitor.check() # Overall severity: "ok" / "warning" / "critical" print(result.severity) # Per-component details for comp, info in result.imbalance_report.items(): print(f"{comp}: {info['real']:.1f}% real vs {info['expected']:.1f}% expected → {info['recommendation']}") # Suggested weight multipliers to rebalance print(result.suggested_reward_weights)
Raises

ValueError if no steps have been recorded yet.

monitor.print_report()

Print a formatted balance table to stdout. Equivalent to calling check() and passing the result to RewardGuard().print_analysis_report().

Example output
============================================================ REWARDGUARD ANALYSIS REPORT *** OVERALL SEVERITY: WARNING *** ============================================================ Episodes analyzed : 8470 Sources found : safety, task Source Real % Expected % Diff Severity --------------- ---------- ------------ ---------- -------- task 52.3 70.0 -17.7 WARNING safety 47.7 30.0 +17.7 WARNING Suggested weight multipliers: safety: 0.84x <-- ADJUST task: 1.05x <-- ADJUST Actions needed: • task: Increase weight by ~17.7% • safety: Decrease weight by ~17.7% ============================================================

monitor.reset()

Clear all accumulated history without changing the configuration. step_count is reset to 0. Useful between distinct training phases.

monitor.reset() print(monitor.step_count) # → 0

Properties

PropertyTypeDescription
step_countintTotal steps recorded since creation or last reset.
expectedDict[str, float]Normalized target percentage distribution (sums to 100). Read-only copy.

AnalysisResult Free

Returned by monitor.check() and analyze_balance(). A dataclass — all fields are read-only after construction.

FieldTypeDescription
real_percentagesDict[str, float]Observed percentage share of each component over the rolling window, measured by reward magnitude.
expected_percentagesDict[str, float]Target percentage distribution (normalized, sums to 100).
imbalance_reportDict[str, Dict]Per-component breakdown. Each inner dict has keys: real, expected, difference, abs_difference, status ("balanced"/"imbalanced"), severity ("ok"/"warning"/"critical"), recommendation (string), unexpected (bool).
suggested_reward_weightsDict[str, float]Recommended multipliers to rebalance components. Values > 1.0 mean increase that component's weight; < 1.0 means decrease. Clamped to [0.1, 5.0]. A value of 5.0 signals reward hacking (component completely absent).
episode_countintNumber of steps (or episodes, for the log-based API) included in the analysis.
sources_foundList[str]Sorted list of component names seen in the real data.
severitystrOverall severity: "ok", "warning", or "critical". Elevated to the worst component severity.
unexpected_sourcesList[str]Components present in real data but absent from expected. These are flagged but not penalized.

Severity thresholds

SeverityCondition
okabs_difference ≤ tolerance
warningtolerance < abs_difference ≤ 3 × tolerance
criticalabs_difference > 3 × tolerance, or component completely absent

Log-based API Free

For post-hoc analysis of training log files rather than live in-loop monitoring.

Module-level convenience functions

rg.parse_logs(raw_text) → List[EpisodeData]

Parse a multi-episode training log from a raw string. Each episode block must start with a header matching Ep <N> | <STATUS> | reward=<value>. Returns a list of EpisodeData objects.

episodes = rg.parse_logs(open("training.log").read())

rg.analyze_balance(parsed_data, expected_percentages) → AnalysisResult

Analyze balance between observed and expected distributions across a list of episodes.

ParameterTypeDescription
parsed_dataList[EpisodeData]Output of parse_logs().
expected_percentagesDict[str, float]Target percentage per source. Need not sum to 100 — will be normalized.
episodes = rg.parse_logs(open("training.log").read()) result = rg.analyze_balance(episodes, expected={"task": 60, "safety": 40}) rg.RewardGuard().print_analysis_report(result)

rg.recommend_weights(real_percentages, expected_percentages) → Dict[str, float]

Standalone function to compute suggested weight multipliers from two distributions, without needing episode data.

RewardGuard class

The log-based facade that wraps LogParser and RewardAnalyzer. Useful when you need stateful parsing across multiple calls.

rg_instance = rg.RewardGuard(tolerance=5.0) episodes = rg_instance.parse_logs(raw_text) result = rg_instance.analyze_balance(episodes, expected={"task": 60, "safety": 40}) rg_instance.print_analysis_report(result)

plot_balance() Free

Render a two-panel matplotlib figure from an AnalysisResult. Requires pip install matplotlib.

Left panel: grouped horizontal bar chart showing real vs expected percentage per component, bars colored by severity. Right panel: suggested weight multipliers with a dashed "no change" line at 1.0.

from rewardguard import plot_balance plot_balance(monitor.check()) # Save headlessly (no display) plot_balance(monitor.check(), save_path="report.png", show=False)
ParameterTypeDefaultDescription
resultAnalysisResultrequiredOutput of monitor.check() or analyze_balance().
titlestr | NoneNoneOptional chart title. Defaults to a summary string including step count and severity.
save_pathstr | NoneNoneIf provided, save the figure to this path (e.g. "report.png").
showboolTrueCall plt.show() after rendering. Set to False when saving headlessly in a CI system.

AutoMonitor Premium

Drop-in superset of Monitor. Every free-tier method works unchanged. Extends it with baseline learning, z-score detection, a continuous alignment score, automatic weight correction, framework callbacks, and export/persistence.

from rewardguard_premium import AutoMonitor

Constructor

Accepts all Monitor parameters plus the following:

ParameterTypeDefaultDescription
baseline_stepsint300Steps collected before z-score detection activates. During warm-up, step() returns None.
z_thresholdfloat | Dict[str, float]2.5Z-score threshold at which the alignment score equals 0.5 and the component is flagged. Pass a single float to apply the same threshold to all components, or a dict of {"component": threshold} for per-component sensitivity.
sigmoid_steepnessfloat1.2Controls how sharply the 0–1 alignment score drops around z_threshold.
auto_correctboolTrueIf True, automatically adjust weights when a component is flagged and the confidence window is satisfied.
correction_ratefloat0.2Fraction of the required correction applied per auto-correct call. Lower → smoother but slower convergence.
correction_rate_decayfloat0.0Amount by which correction_rate is reduced after each correction. Set > 0 to make corrections progressively more conservative over time.
min_confidence_stepsint50Minimum post-baseline steps before the first automatic correction is allowed.
drift_windowint30Number of recent snapshots used to estimate drift velocity (slope of alignment score).
starvation_windowint20Consecutive steps a component must be near-zero to trigger a starvation alert.
starvation_thresholdfloat1.0Absolute value below which a component's reward is considered "near-zero" for starvation detection purposes.
callbacksList[Callable][]Callables invoked with the AlignmentSnapshot after each post-baseline step. See Callbacks.

monitor.step(rewards) → AlignmentSnapshot | None

Same signature as the free step(). During the first baseline_steps steps, returns None — the monitor is learning the baseline distribution. After warm-up, returns an AlignmentSnapshot every call.

for step_idx in range(total_steps): action = policy.act(state) next_state, _, done, info = env.step(action) snapshot = monitor.step({ "task": info["task_reward"], "safety": info["safety_reward"], }) # snapshot is None during baseline warm-up if snapshot is not None: if snapshot.flag == "critical": env.set_reward_weights(monitor.weights) state = next_state if not done else env.reset()

Properties

PropertyTypeDescription
weightsDict[str, float]Current reward weight multipliers. Start at 1.0; drift as auto-correction runs. Always read this — never a stale copy — when applying weights to your environment.
alignment_scorefloatMost recent alignment score in [0, 1]. Shortcut for monitor.snapshots[-1].alignment_score. Returns 1.0 during warm-up since no deviation has been detected yet.
is_baseline_completeboolTrue once the baseline warm-up window has been filled and z-score detection is active.
snapshotsList[AlignmentSnapshot]All alignment snapshots produced since the baseline warm-up completed. The full per-step audit trail.
step_countintInherited from Monitor. Total steps recorded.

AlignmentSnapshot Premium

A point-in-time alignment measurement produced by AutoMonitor.step() after the baseline warm-up completes. Timestamped by global step index.

FieldTypeDescription
stepintGlobal step index when this snapshot was taken.
alignment_scorefloat0.0 (fully misaligned) → 1.0 (fully aligned). Sigmoid-mapped from the maximum per-component z-score.
component_ratiosDict[str, float]Rolling-window percentage share of each component at this step.
z_scoresDict[str, float]Per-component deviation from the learned baseline in standard deviations. Values > z_threshold trigger flagging.
drift_velocityfloatSlope of alignment_score over the last drift_window snapshots. Negative = worsening trend; positive = recovering.
flagstr"ok" / "warning" / "critical" based on max z-score vs threshold.
corrections_appliedDict[str, float]Mapping of component → new weight for any automatic corrections applied at this step. Empty dict if no correction was made.
starvation_alertsList[str]Components that have been near-zero for starvation_window consecutive steps — a strong reward-hacking signal.
# Iterate the correction history after training for snap in monitor.snapshots: if snap.corrections_applied: print( f"step={snap.step:6d} score={snap.alignment_score:.3f} " f"flag={snap.flag:<8s} corrections={snap.corrections_applied}" )
Example output
step= 4120 score=0.431 flag=warning corrections={'safety': 1.0460} step= 5300 score=0.318 flag=critical corrections={'safety': 1.0955, 'task': 0.9740} step= 6480 score=0.402 flag=critical corrections={'safety': 1.1243} step= 9400 score=0.693 flag=warning corrections={'safety': 1.0412} step= 11040 score=0.812 flag=ok corrections={}

snapshot.to_dict()

Serialize to a plain dict suitable for JSON encoding.

Exports Premium

Three methods give you the complete per-step record for downstream analysis, dashboards, or CI artifacts.

monitor.to_csv(path=None) → str

Returns the full snapshot history as a CSV string. Optionally writes to disk. Columns: step, alignment_score, flag, drift_velocity, starvation_alerts, then ratio_<comp> and z_<comp> for every component.

csv_str = monitor.to_csv("audit_run_42.csv")
CSV preview
step,alignment_score,flag,drift_velocity,starvation_alerts,ratio_safety,z_safety,ratio_task,z_task 301,0.983241,ok,0.000000,,29.84,-0.1234,70.16,0.1234 302,0.981002,ok,0.000000,,30.12,-0.0981,69.88,0.0981 ... 412,0.431008,warning,-0.002341,,47.23,2.6801,52.77,-2.6801

monitor.to_json(path=None) → str

Returns full state as a JSON string — including baseline statistics, current weights, and all snapshots. Optionally writes to disk.

json_str = monitor.to_json("audit_run_42.json")

monitor.print_report()

Prints the free-tier balance table plus the premium alignment state: current alignment score, drift velocity, current weights, and per-component z-scores.

Save & Resume Premium

Long runs can be checkpointed and resumed. The saved state includes baseline statistics, all snapshots, and the current weights — a resumed run continues seamlessly from where it left off.

# Save at any checkpoint monitor.save("run_42_state.json") # Resume in a new process monitor = AutoMonitor.load("run_42_state.json") print(monitor.step_count) # picks up where it left off print(monitor.weights) # previously learned weights print(len(monitor.snapshots)) # all historical snapshots intact
Override on load

Pass any constructor keyword argument to AutoMonitor.load() to override a saved setting — for example AutoMonitor.load("state.json", auto_correct=False) to replay the saved history in read-only mode.

Callbacks Premium

Pass a list of callables to the callbacks constructor parameter. Each callback is invoked with the AlignmentSnapshot after every post-baseline step. Three built-in factory functions are provided.

Weights & Biases

import wandb from rewardguard_premium import AutoMonitor, make_wandb_callback wandb.init(project="my-rl-run") monitor = AutoMonitor( expected={"task": 0.7, "safety": 0.3}, callbacks=[make_wandb_callback()], )

Logs rewardguard/alignment_score, rewardguard/drift_velocity, rewardguard/ratio/<comp>, and rewardguard/z_score/<comp> at each step.

TensorBoard

from torch.utils.tensorboard import SummaryWriter from rewardguard_premium import AutoMonitor, make_tensorboard_callback writer = SummaryWriter("runs/my_run") monitor = AutoMonitor( expected={"task": 0.7, "safety": 0.3}, callbacks=[make_tensorboard_callback(writer)], )

Stable-Baselines3

The SB3 callback reads info["reward_components"] from each environment step. Your environment must include this key.

from stable_baselines3 import PPO from rewardguard_premium import AutoMonitor, make_sb3_callback monitor = AutoMonitor(expected={"task": 0.7, "safety": 0.3}) cb = make_sb3_callback(monitor) model = PPO("MlpPolicy", env) model.learn(total_timesteps=5_000_000, callback=cb) monitor.to_csv("sb3_run.csv")

Custom callback

Any callable that accepts an AlignmentSnapshot works.

def my_callback(snapshot): if snapshot.corrections_applied: my_logger.info( "step=%d corrections=%s", snapshot.step, snapshot.corrections_applied, ) monitor = AutoMonitor( expected={"task": 0.7, "safety": 0.3}, callbacks=[my_callback], )

plot_session() Premium

Render a multi-panel timeline figure from a completed AutoMonitor session. Requires pip install matplotlib.

from rewardguard_premium import plot_session plot_session(monitor) # Save headlessly plot_session(monitor, save_path="session_report.png", show=False)

CI/CD Integration

Fail a training run automatically when reward hacking is detected above a severity threshold. The AnalysisResult.severity string is the simplest gate; for more precision use the alignment score from the premium tier.

Free — severity gate

result = monitor.check() if result.severity == "critical": raise RuntimeError( f"Training aborted: reward hacking detected" )

Premium — alignment score gate

# Check after each evaluation epoch if monitor.snapshots: latest = monitor.snapshots[-1] if latest.alignment_score < 0.4: monitor.to_csv("ci_artifact_failed_run.csv") raise RuntimeError( f"Training aborted: alignment score {latest.alignment_score:.2f} below threshold" )

Authentication Premium

The premium package authenticates with your RewardGuard account at runtime — not at import time. This allows offline imports, type-checking, and unit tests without credentials.

Option 1 — Interactive sign-in (recommended)

Run once before using the package. The session is saved and refreshed automatically.

rewardguard-premium login

Option 2 — API token (CI / automated environments)

Generate a token on your dashboard and export it in your CI environment. No password is stored or transmitted.

export REWARDGUARD_API_TOKEN='your-api-token-here'

Option 3 — Environment credentials

export REWARDGUARD_EMAIL='you@example.com' export REWARDGUARD_PASSWORD='...'
Offline grace period

If the package cannot reach the server, it runs in offline mode for up to 24 hours from the last successful verification. After that, a LicenseError is raised on the next AutoMonitor instantiation.

Checking status / signing out

rewardguard-premium status # show current session rewardguard-premium logout # sign out and remove session
from rewardguard_premium import clear_session clear_session() # clear the in-process cache and remove session file