RewardGuard Documentation
API reference for the rewardguard (free) and rewardguard-premium packages. Both packages detect reward component imbalances in reinforcement learning training loops. The premium package adds statistical detection, automatic weight correction, and a full per-step audit trail.
- Rolling-window balance analysis
- Per-component imbalance detection
- Suggested weight multipliers
- Log-file post-hoc analysis
- Matplotlib visualization
- Statistical z-score detection
- Continuous 0–1 alignment score
- Automatic reward weight correction
- Full timestamped correction log
- CSV / JSON export, save & resume
- WandB, TensorBoard, SB3 callbacks
Installation
Free package
Available on PyPI, no license required.
Premium package
Install via PyPI after purchasing at rewardguard.dev/premium. The package authenticates with your RewardGuard account at runtime — no extra index needed.
The premium package depends on the free package. Import premium features from rewardguard_premium.
After installing, sign in once with rewardguard-premium login. Your session is saved to ~/.rewardguard/session.json and refreshed automatically — no sign-in needed on subsequent runs.
Quick Start
Free — live in-loop monitoring
Premium — auto-correction
Monitor Free
The primary live in-loop API. Zero external dependencies. Drop one monitor.step() call inside your training loop — call check() or print_report() whenever you want an analysis.
Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
| expected | Dict[str, float] | required | Target distribution of reward components. Values are relative weights — they are normalized to percentages automatically. Example: {"task": 3, "safety": 1} → task 75%, safety 25%. |
| tolerance | float | 5.0 | Percentage-point tolerance before a component is flagged as imbalanced (±pp). A difference of ≤ tolerance is "ok"; ≤ 3× tolerance is "warning"; > 3× is "critical". |
| window | int | 200 | Number of recent steps used for rolling analysis. Older steps are retained in memory up to max_history. |
| max_history | int | 100 000 | Hard cap on stored steps. Implemented as a deque — oldest entries are evicted in O(1) once the cap is reached. |
All expected weights must be ≥ 0 and finite, and must sum to a positive number. tolerance and window must each be > 0. Violations raise ValueError at construction time.
monitor.step(rewards, episode_done=False)
Record reward components for one environment step. Call once per step, every step.
| Parameter | Type | Description |
|---|---|---|
| rewards | Dict[str, float] | Component values for this step, e.g. {"task": 1.2, "safety": -0.1}. All values must be finite — NaN or Inf raises ValueError immediately. |
| episode_done | bool | Unused in the free tier; accepted so code written against the premium AutoMonitor API works unchanged when downgraded. |
monitor.check() → AnalysisResult
Compute a balance analysis over the current rolling window. Returns an AnalysisResult. Does not modify state — safe to call as often as needed.
ValueError if no steps have been recorded yet.
monitor.print_report()
Print a formatted balance table to stdout. Equivalent to calling check() and passing the result to RewardGuard().print_analysis_report().
monitor.reset()
Clear all accumulated history without changing the configuration. step_count is reset to 0. Useful between distinct training phases.
Properties
| Property | Type | Description |
|---|---|---|
| step_count | int | Total steps recorded since creation or last reset. |
| expected | Dict[str, float] | Normalized target percentage distribution (sums to 100). Read-only copy. |
AnalysisResult Free
Returned by monitor.check() and analyze_balance(). A dataclass — all fields are read-only after construction.
| Field | Type | Description |
|---|---|---|
| real_percentages | Dict[str, float] | Observed percentage share of each component over the rolling window, measured by reward magnitude. |
| expected_percentages | Dict[str, float] | Target percentage distribution (normalized, sums to 100). |
| imbalance_report | Dict[str, Dict] | Per-component breakdown. Each inner dict has keys: real, expected, difference, abs_difference, status ("balanced"/"imbalanced"), severity ("ok"/"warning"/"critical"), recommendation (string), unexpected (bool). |
| suggested_reward_weights | Dict[str, float] | Recommended multipliers to rebalance components. Values > 1.0 mean increase that component's weight; < 1.0 means decrease. Clamped to [0.1, 5.0]. A value of 5.0 signals reward hacking (component completely absent). |
| episode_count | int | Number of steps (or episodes, for the log-based API) included in the analysis. |
| sources_found | List[str] | Sorted list of component names seen in the real data. |
| severity | str | Overall severity: "ok", "warning", or "critical". Elevated to the worst component severity. |
| unexpected_sources | List[str] | Components present in real data but absent from expected. These are flagged but not penalized. |
Severity thresholds
| Severity | Condition |
|---|---|
| ok | abs_difference ≤ tolerance |
| warning | tolerance < abs_difference ≤ 3 × tolerance |
| critical | abs_difference > 3 × tolerance, or component completely absent |
Log-based API Free
For post-hoc analysis of training log files rather than live in-loop monitoring.
Module-level convenience functions
rg.parse_logs(raw_text) → List[EpisodeData]
Parse a multi-episode training log from a raw string. Each episode block must start with a header matching Ep <N> | <STATUS> | reward=<value>. Returns a list of EpisodeData objects.
rg.analyze_balance(parsed_data, expected_percentages) → AnalysisResult
Analyze balance between observed and expected distributions across a list of episodes.
| Parameter | Type | Description |
|---|---|---|
| parsed_data | List[EpisodeData] | Output of parse_logs(). |
| expected_percentages | Dict[str, float] | Target percentage per source. Need not sum to 100 — will be normalized. |
rg.recommend_weights(real_percentages, expected_percentages) → Dict[str, float]
Standalone function to compute suggested weight multipliers from two distributions, without needing episode data.
RewardGuard class
The log-based facade that wraps LogParser and RewardAnalyzer. Useful when you need stateful parsing across multiple calls.
plot_balance() Free
Render a two-panel matplotlib figure from an AnalysisResult. Requires pip install matplotlib.
Left panel: grouped horizontal bar chart showing real vs expected percentage per component, bars colored by severity. Right panel: suggested weight multipliers with a dashed "no change" line at 1.0.
| Parameter | Type | Default | Description |
|---|---|---|---|
| result | AnalysisResult | required | Output of monitor.check() or analyze_balance(). |
| title | str | None | None | Optional chart title. Defaults to a summary string including step count and severity. |
| save_path | str | None | None | If provided, save the figure to this path (e.g. "report.png"). |
| show | bool | True | Call plt.show() after rendering. Set to False when saving headlessly in a CI system. |
AutoMonitor Premium
Drop-in superset of Monitor. Every free-tier method works unchanged. Extends it with baseline learning, z-score detection, a continuous alignment score, automatic weight correction, framework callbacks, and export/persistence.
Constructor
Accepts all Monitor parameters plus the following:
| Parameter | Type | Default | Description |
|---|---|---|---|
| baseline_steps | int | 300 | Steps collected before z-score detection activates. During warm-up, step() returns None. |
| z_threshold | float | Dict[str, float] | 2.5 | Z-score threshold at which the alignment score equals 0.5 and the component is flagged. Pass a single float to apply the same threshold to all components, or a dict of {"component": threshold} for per-component sensitivity. |
| sigmoid_steepness | float | 1.2 | Controls how sharply the 0–1 alignment score drops around z_threshold. |
| auto_correct | bool | True | If True, automatically adjust weights when a component is flagged and the confidence window is satisfied. |
| correction_rate | float | 0.2 | Fraction of the required correction applied per auto-correct call. Lower → smoother but slower convergence. |
| correction_rate_decay | float | 0.0 | Amount by which correction_rate is reduced after each correction. Set > 0 to make corrections progressively more conservative over time. |
| min_confidence_steps | int | 50 | Minimum post-baseline steps before the first automatic correction is allowed. |
| drift_window | int | 30 | Number of recent snapshots used to estimate drift velocity (slope of alignment score). |
| starvation_window | int | 20 | Consecutive steps a component must be near-zero to trigger a starvation alert. |
| starvation_threshold | float | 1.0 | Absolute value below which a component's reward is considered "near-zero" for starvation detection purposes. |
| callbacks | List[Callable] | [] | Callables invoked with the AlignmentSnapshot after each post-baseline step. See Callbacks. |
monitor.step(rewards) → AlignmentSnapshot | None
Same signature as the free step(). During the first baseline_steps steps, returns None — the monitor is learning the baseline distribution. After warm-up, returns an AlignmentSnapshot every call.
Properties
| Property | Type | Description |
|---|---|---|
| weights | Dict[str, float] | Current reward weight multipliers. Start at 1.0; drift as auto-correction runs. Always read this — never a stale copy — when applying weights to your environment. |
| alignment_score | float | Most recent alignment score in [0, 1]. Shortcut for monitor.snapshots[-1].alignment_score. Returns 1.0 during warm-up since no deviation has been detected yet. |
| is_baseline_complete | bool | True once the baseline warm-up window has been filled and z-score detection is active. |
| snapshots | List[AlignmentSnapshot] | All alignment snapshots produced since the baseline warm-up completed. The full per-step audit trail. |
| step_count | int | Inherited from Monitor. Total steps recorded. |
AlignmentSnapshot Premium
A point-in-time alignment measurement produced by AutoMonitor.step() after the baseline warm-up completes. Timestamped by global step index.
| Field | Type | Description |
|---|---|---|
| step | int | Global step index when this snapshot was taken. |
| alignment_score | float | 0.0 (fully misaligned) → 1.0 (fully aligned). Sigmoid-mapped from the maximum per-component z-score. |
| component_ratios | Dict[str, float] | Rolling-window percentage share of each component at this step. |
| z_scores | Dict[str, float] | Per-component deviation from the learned baseline in standard deviations. Values > z_threshold trigger flagging. |
| drift_velocity | float | Slope of alignment_score over the last drift_window snapshots. Negative = worsening trend; positive = recovering. |
| flag | str | "ok" / "warning" / "critical" based on max z-score vs threshold. |
| corrections_applied | Dict[str, float] | Mapping of component → new weight for any automatic corrections applied at this step. Empty dict if no correction was made. |
| starvation_alerts | List[str] | Components that have been near-zero for starvation_window consecutive steps — a strong reward-hacking signal. |
snapshot.to_dict()
Serialize to a plain dict suitable for JSON encoding.
Exports Premium
Three methods give you the complete per-step record for downstream analysis, dashboards, or CI artifacts.
monitor.to_csv(path=None) → str
Returns the full snapshot history as a CSV string. Optionally writes to disk. Columns: step, alignment_score, flag, drift_velocity, starvation_alerts, then ratio_<comp> and z_<comp> for every component.
monitor.to_json(path=None) → str
Returns full state as a JSON string — including baseline statistics, current weights, and all snapshots. Optionally writes to disk.
monitor.print_report()
Prints the free-tier balance table plus the premium alignment state: current alignment score, drift velocity, current weights, and per-component z-scores.
Save & Resume Premium
Long runs can be checkpointed and resumed. The saved state includes baseline statistics, all snapshots, and the current weights — a resumed run continues seamlessly from where it left off.
Pass any constructor keyword argument to AutoMonitor.load() to override a saved setting — for example AutoMonitor.load("state.json", auto_correct=False) to replay the saved history in read-only mode.
Callbacks Premium
Pass a list of callables to the callbacks constructor parameter. Each callback is invoked with the AlignmentSnapshot after every post-baseline step. Three built-in factory functions are provided.
Weights & Biases
Logs rewardguard/alignment_score, rewardguard/drift_velocity, rewardguard/ratio/<comp>, and rewardguard/z_score/<comp> at each step.
TensorBoard
Stable-Baselines3
The SB3 callback reads info["reward_components"] from each environment step. Your environment must include this key.
Custom callback
Any callable that accepts an AlignmentSnapshot works.
plot_session() Premium
Render a multi-panel timeline figure from a completed AutoMonitor session. Requires pip install matplotlib.
CI/CD Integration
Fail a training run automatically when reward hacking is detected above a severity threshold. The AnalysisResult.severity string is the simplest gate; for more precision use the alignment score from the premium tier.
Free — severity gate
Premium — alignment score gate
Authentication Premium
The premium package authenticates with your RewardGuard account at runtime — not at import time. This allows offline imports, type-checking, and unit tests without credentials.
Option 1 — Interactive sign-in (recommended)
Run once before using the package. The session is saved and refreshed automatically.
Option 2 — API token (CI / automated environments)
Generate a token on your dashboard and export it in your CI environment. No password is stored or transmitted.
Option 3 — Environment credentials
If the package cannot reach the server, it runs in offline mode for up to 24 hours from the last successful verification. After that, a LicenseError is raised on the next AutoMonitor instantiation.