Signals for 30 May 2026

Tien geselecteerde signalen over agent-evals, wetenschappelijke benchmarks, multi-agent coherentie, AI-werkflows en domeinmodellen.

Gram: Assessing sabotage propensities via automated alignment auditing

arXiv reasoning / agents / evals

Gram introduces automated alignment auditing for sabotage propensity in agentic coding and research deployments. This matters because serious agent rollout needs behavior-level evaluation, not just output review.

#agent #evals #research-evals

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

arXiv reasoning / agents / evals

ProjectionBench evaluates scientific hypothesis generation as information is progressively revealed. This is useful because real discovery work is uncertain and incremental, not a static benchmark lookup.

#evals #research-evals #systems-framing

The deadly Ebola outbreak is proving difficult to control

MIT Technology Review AI

MIT Technology Review covers the operational difficulty of controlling a Bundibugyo virus outbreak. The selected angle is operational readiness: detection, coordination and feedback loops matter more than isolated capability.

#builder #evals #tooling-runtime

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

arXiv reasoning / agents / evals

This paper formalizes how multi-component LLM agents can be locally coherent but globally incoherent. It is directly relevant to workflow design, runtime checks and repair mechanisms for composed agent systems.

#agent #builder #research-evals #systems-framing

What happens when companies become too AI-pilled?

TechCrunch AI

TechCrunch discusses Aaron Levie's warning that executives deciding AI can replace jobs often understand those jobs poorly. The useful signal is the need for workflow understanding before automation claims.

#agent #agentic-workflows #evals

Does your CEO have AI psychosis? Aaron Levie thinks most of them do.

TechCrunch AI

This podcast covers the same critique from Aaron Levie: AI replacement narratives often skip the actual work analysis. It supports a problem-first implementation position.

#agent #agentic-workflows #evals

New review paper argues code is how AI agents think and act, not just what they produce

The Decoder

The Decoder summarizes a review paper arguing that tools, memory, tests and permission boundaries are the layer that turns a model into an agent. Useful frame: model plus harness equals agent.

#agent #agentic-workflows

Cognition's Scott Wu says AI coding agents shouldn't replace humans

TechCrunch AI

Cognition positions Devin as a coding agent that works with human programmers rather than replacing them. This points to delegation, review and workflow ownership as the mature agent pattern.

#agent #agentic-workflows

OpenAI is giving away its life sciences AI model to help governments prepare for the next pandemic

The Decoder

OpenAI is offering GPT-Rosalind through the Rosalind Biodefense program. The practical signal is domain AI as public infrastructure, where governance and evaluation matter as much as capability.

#evals #research-evals

PATH's agentic AI solutions could accelerate enterprise adoption - MSN

Google News AI Adoption

Google News surfaced an MSN item about PATH's agentic AI solutions and enterprise adoption. It is the weakest selected signal, but it shows the current market language Bart should pressure-test: what work, permissions and accountability actually change?

#agent #agentic-workflows #implementation