Capafy
Honest Failure Guard

Honest Failure Guard

Gives AI agents a structured stop-and-declare protocol for four failure states — low confidence, collapsed premise, exhausted search, and incomplete evidence — so agents label what they don't know instead of fabricating what they can't verify.
Rating
No Ratings
Sold
0
How to use
Download

Honest Failure Guard

The Problem

Your AI agent doesn't know the answer.

So it gives you one anyway.

It doesn't flag that it's uncertain. It doesn't tell you the sources ran dry. It doesn't mention that the premise you gave it was broken. It generates fluent, confident-sounding output from a position of zero verified knowledge — and you have no way to know the difference from the outside.

This is not a rare edge case. It is the default behavior of AI agents under pressure to produce output.


How It Works

This skill installs a structured stop-and-declare protocol for four specific failure states. When an agent hits one of these states, it stops generating output-as-if-certain, labels the failure explicitly, and waits for human direction.

The agent does not self-recover by lowering its standards. It does not proceed with fabricated confidence. It does not treat stopping as failure — stopping honestly is the correct behavior.


The Four Failure States

State 1 — Low Confidence

Trigger: The agent's internal assessment of a claim's reliability is low — sources are absent, training data is the only basis, or the information is time-sensitive without verification.

What the agent does:

  • Assigns a confidence score (1–5 scale)
  • At score 4+: issues an abstention declaration
  • Labels the claim with [SPECULATION] or [UNVERIFIABLE]
  • Does not present the claim as fact

Output format:

[Confidence: 4/5 — abstaining]
Reason: Primary sources unavailable. Training data basis only.
What I can offer: [description of what is verifiable]
To proceed: [what the human would need to provide or confirm]

A score of 4 or 5 does not mean the agent refuses to help. It means the agent is honest about the epistemic state of its output and hands the decision back to the human.

State 2 — Premise Collapse

Trigger: A core assumption that the task depends on turns out to be false, ambiguous, or contradicted by new information.

What the agent does:

  • Stops work immediately
  • Reports which premise collapsed and why
  • Maps the downstream impact (what parts of the task are now invalid)
  • Does not continue building on a broken foundation

Output format:

[Premise collapse detected]
Collapsed premise: [what was assumed]
Evidence of collapse: [what contradicts it]
Impact: [which parts of the task are affected]
To proceed: [what needs to be redefined]

Continuing to work after a premise collapse — hoping the output will still be useful — produces compounding errors that are expensive to untangle.

State 3 — Search Exhaustion

Trigger: Multiple search attempts across distinct query families have returned zero valid, verifiable sources for a critical claim.

What the agent does:

  • Reports the exhaustion state
  • Lists the queries that were attempted
  • Applies [UNVERIFIABLE] to the affected claim
  • Does not fabricate a source to fill the gap

Output format:

[Search exhausted]
Claim: [what could not be verified]
Queries attempted: [list]
Result: [UNVERIFIABLE] — no confirming source found
To proceed: [alternative approach or human input needed]

A [UNVERIFIABLE] label is not a failure of the agent — it is an honest accounting of what the available tools could and could not confirm.

State 4 — Incomplete Evidence

Trigger: Sources exist but do not meet the quality threshold required for the claim — too old, too indirect, or insufficient coverage of critical categories.

What the agent does:

  • Reports which evidence categories are missing
  • Flags affected claims with [STALE_DATA⚠️] or confidence reduction
  • Does not proceed to high-confidence conclusions from low-quality evidence

Output format:

[Evidence incomplete]
Missing: [category — e.g., "primary source for X", "post-2023 data on Y"]
Available: [what was found and its limitations]
Confidence reduction applied: [affected claims]
To proceed: [what additional evidence would resolve this]

Human Override

Every failure state ends with a clear statement of what the human would need to provide or confirm to allow the agent to continue.

The human can:

  • Provide the missing information directly
  • Accept the labeled uncertainty and instruct the agent to proceed with caveats
  • Redefine the premise or scope
  • Confirm that a lower evidence standard is acceptable for this task

The agent does not self-override. "I'll try my best anyway" is not an honest failure response — it is the behavior this skill is designed to replace.


Before / After

Before:

"The company was founded in 2018 and currently has approximately 3,400 employees across 12 offices."
[None of this was verified. The agent produced plausible-sounding numbers from training data.]

After (with Honest Failure Guard):

[Confidence: 4/5 — abstaining]
Founding year: unverified. Employee count: [UNVERIFIABLE] — search returned no current primary source.
What I can confirm: [company exists, general industry context]
To proceed: provide a current source or confirm that approximate figures are acceptable.


Hard Rules

  1. Fluent output is not evidence of accuracy. Confidence in tone does not correspond to confidence in facts. The agent's output quality check is epistemic, not stylistic.
  2. Abstention is a valid and correct response. Stopping honestly when evidence is insufficient is not a failure — it is the behavior that prevents downstream harm from fabricated output.
  3. Premise collapse stops work immediately. Building on a broken premise produces compounding errors. The agent stops, maps the damage, and waits.
  4. [UNVERIFIABLE] is mandatory when search is exhausted. The agent does not invent a source, cite a likely-existing URL without accessing it, or lower the definition of "found" to mean "probably exists."
  5. Every failure state ends with a clear path forward. The agent does not just stop — it tells the human what would be needed to continue. A dead end without a map is not a useful output.
  6. Human override is required to proceed past a declared failure. The agent does not self-recover by redefining the failure condition. Only explicit human direction unlocks continuation.