Why We Open-Sourced the Detection Layer

When we started building RewardGuard, the first question we had to answer was: what's the business model? The answer felt obvious at first — charge for the tool. Build something useful, put it behind a subscription, grow the company.

The more we thought about it, the less comfortable we were with that answer. The problem RewardGuard addresses — reward hacking and misalignment in RL systems — isn't just a problem for well-funded labs. It's a problem for every student training a gym environment, every startup iterating on a fine-tuned model, every researcher who can't afford enterprise tooling. If we kept the detection layer behind a paywall, those people would keep training blind.

So we made a different call: the detection engine is MIT-licensed and free forever. What we sell is the layer on top of it.

What "Open-Sourcing the Detection Layer" Actually Means

RewardGuard is two things in one package. There's the detection layer — the component that ingests reward signals, computes component ratios, applies statistical baselines, and flags anomalies. And there's the action layer — the component that interprets those flags, recommends specific parameter changes, and in premium mode, applies corrections automatically.

The detection layer is open source. Anyone can install it, read the source, modify it, contribute to it, and use it in any project — commercial or otherwise — under the MIT license. The action layer, the auto-correction engine, the advanced reporting interface, and the API integrations are what we charge for.

The Core Principle

You should always be able to see what your agent is doing. Detection is a transparency tool — it belongs in the commons. Automated correction and production-grade tooling are where we build a business.

What's Free vs. What's Premium

To be concrete about the boundary:

Feature	Free (MIT)	Premium
Component reward logging	✓	✓
Ratio computation & anomaly detection	✓	✓
Training run reports (JSON/text)	✓	✓
Threshold configuration	✓	✓
CI/CD integration hooks	✓	✓
Guided rebalancing suggestions	—	✓
Auto-correction engine	—	✓
Live training dashboard	—	✓
Multi-run comparison	—	✓
Priority support & SLA	—	✓

The free tier gives you everything you need to know what's going wrong. The premium tier gives you the tools to fix it faster and at scale.

The Practical Argument for Openness

There's a pragmatic reason to open-source detection beyond the idealistic one: open-source tools get better faster. The RL research community includes thousands of people who are deeply familiar with the failure modes we're trying to detect. Some of them will use the free tool and find edge cases we never thought of. Some will contribute detection heuristics for environments we haven't built test cases for. Some will file issues that turn into features.

A closed detection engine is a detection engine that only knows about the failure modes its paid users encountered. An open detection engine knows about every failure mode anyone encountered — and the community actively helps fix the gaps.

The result is a better free tool and, because the premium tier depends on the same detection layer, a better premium product too.

What We're Not Open-Sourcing, and Why

The auto-correction engine is not open source. This isn't about protecting trade secrets — it's about incentive structures. If the auto-correction engine were MIT-licensed, there would be no reason to pay for premium, which means there would be no revenue to fund continued development of either tier. An open-source project that can't sustain itself eventually stagnates.

We've seen this pattern in developer tooling: companies that try to open-source everything often end up with great free tools that get abandoned when funding runs out. The open-core model — free tier for individual use, paid tier for production teams — is how you build something that can keep improving.

We want the detection layer to be the standard instrumentation for RL training workflows, the way coverage tools became standard for software testing. That only happens if it's free and accessible.

How to Contribute

The free package is on GitHub under the RewardGuard organization. Contributions are welcome in a few areas:

New detection heuristics: If you've encountered a reward hacking pattern that existing ratio analysis doesn't catch, open an issue with a minimal reproducing example. A good bug report here is more valuable than code — we can build the detector once we understand the failure mode.
Environment integrations: RewardGuard works with any RL loop by design, but first-class integrations with specific environments (Gymnasium, PettingZoo, custom sim environments) reduce the setup friction significantly.
Documentation and tutorials: The technical content in this blog exists partly because the gap between "what does RewardGuard detect" and "why does this matter" needs to be bridged for people new to reward hacking. More examples from real environments are always useful.
Bug reports: Edge cases in threshold computation, unexpected behavior with sparse reward environments, issues with specific Python versions — all of this helps.

The decision to open-source the detection layer was easy once we framed it correctly. Monitoring and transparency in AI training shouldn't be a premium feature. If you're training an RL agent anywhere — in a research lab, a startup, a class project — you should be able to see what your reward signal is actually doing. That's what the free tier is for. Everything else is how we make sure it's still here in five years.

What "Open-Sourcing the Detection Layer" Actually Means

What's Free vs. What's Premium

The Practical Argument for Openness

What We're Not Open-Sourcing, and Why

How to Contribute

Continue Reading

Getting Started with RewardGuard

Why Your RL Agent Is Cheating (And How to Catch It)

Goodhart's Law and the RL Agent