HCS-25 (Signal Family): SimpleMath / SimpleScience Evals (Informative)

This document complements HCS-25 by describing a concrete, minimal evaluation methodology used by some ecosystems to populate “baseline correctness” trust signals.

The goal is not to exhaustively benchmark intelligence. The goal is to detect:

subjects that cannot follow simple instructions reliably,
endpoints that time out or fail to respond, and
endpoints that respond but are consistently incorrect or unparseable.

Identifier guidanceDirect link to Identifier guidance

Simple eval signals SHOULD follow the HCS-25 signal identifier namespace pattern (see Signal Identifier Namespacing).

Implementations typically store per-run details (question id, status, timestamps) and also store an aggregated scalar score used by trust adapters.

SimpleMathDirect link to SimpleMath

Prompt format (recommended)Direct link to Prompt format (recommended)

Implementations SHOULD use a prompt that reduces ambiguity and makes automatic grading robust. For example:

Instruction: “Answer with just the number.”
A single arithmetic expression.

Example:

Answer with just the number.

What is 37 + 58?

Question generation (recommended)Direct link to Question generation (recommended)

Implementations SHOULD generate a random arithmetic question per evaluation run. Common choices include:

addition (two integers, e.g. 10–100)
subtraction (two integers, e.g. 10–100 minus 1–50)
multiplication (small integers, e.g. 2–12)

Implementations SHOULD store a stable questionId describing the operation and operands (e.g., math:add:37+58).

GradingDirect link to Grading

Implementations SHOULD grade by extracting the first numeric token from the response and comparing it to the expected answer.

Recommended outcomes:

correct → score = 100
wrong → score = 0
unparseable (no numeric token) → score = 0
timeout / missing / error → score = 0

Stored signal fieldsDirect link to Stored signal fields

Implementations SHOULD store:

questionId
expected (optional; can be omitted if sensitive)
response (optional; can be omitted/redacted)
status and score
sessionId / conversationId (optional; helps correlate debugging)

SimpleScienceDirect link to SimpleScience

Prompt format (recommended)Direct link to Prompt format (recommended)

Implementations SHOULD use a small multiple-choice question set so grading is deterministic and language-agnostic.

Recommended prompt structure:

A question stem
Four choices labeled A, B, C, D
Instruction: “Answer with just A, B, C, or D.”

Example:

Answer with just A, B, C, or D.

Which gas do plants primarily absorb during photosynthesis?
A) Oxygen
B) Carbon dioxide
C) Nitrogen
D) Helium

Question bankDirect link to Question bank

Implementations SHOULD use a curated, small question bank of basic science facts (chemistry, biology, physics, earth science). Each question MUST have:

stable questionId
prompt text
exactly four options
an unambiguous correct choice A|B|C|D

Implementations SHOULD randomly sample a question per evaluation run.

GradingDirect link to Grading

Implementations SHOULD grade by extracting a single-letter answer from the response:

If a standalone A|B|C|D token exists, use it.
Optionally, fall back to matching an option string when the response contains the full option text.

Recommended outcomes:

correct → score = 100
wrong → score = 0
unparseable → score = 0
timeout / missing / error → score = 0

Time budgets and retriesDirect link to Time budgets and retries

Implementations SHOULD:

set a time budget per eval run (e.g., 20–30s for synchronous chat endpoints);
treat timeouts as a scored failure (0) unless the ecosystem explicitly chooses not to penalize missingness (see HCS-25 contribution modes).

Implementations MAY retry transient failures, but SHOULD cap retries to avoid amplifying load.

Anti-gaming guidance (informative)Direct link to Anti-gaming guidance (informative)

Simple evals are easy to game. Implementations SHOULD:

rotate operands/questions,
avoid publishing the exact full question bank if gaming becomes an issue (publish hashes/ids instead),
add a small randomized “instruction compliance” check (e.g., “answer with just the number”) to reduce prompt-injection style failures.

Identifier guidanceDirect link to Identifier guidance​

SimpleMathDirect link to SimpleMath​

Prompt format (recommended)Direct link to Prompt format (recommended)​

Question generation (recommended)Direct link to Question generation (recommended)​

GradingDirect link to Grading​

Stored signal fieldsDirect link to Stored signal fields​

SimpleScienceDirect link to SimpleScience​

Prompt format (recommended)Direct link to Prompt format (recommended)​

Question bankDirect link to Question bank​

GradingDirect link to Grading​

Time budgets and retriesDirect link to Time budgets and retries​

Anti-gaming guidance (informative)Direct link to Anti-gaming guidance (informative)​

Identifier guidanceDirect link to Identifier guidance

SimpleMathDirect link to SimpleMath

Prompt format (recommended)Direct link to Prompt format (recommended)

Question generation (recommended)Direct link to Question generation (recommended)

GradingDirect link to Grading

Stored signal fieldsDirect link to Stored signal fields

SimpleScienceDirect link to SimpleScience

Prompt format (recommended)Direct link to Prompt format (recommended)

Question bankDirect link to Question bank

GradingDirect link to Grading

Time budgets and retriesDirect link to Time budgets and retries

Anti-gaming guidance (informative)Direct link to Anti-gaming guidance (informative)