Signals for 2026-06-17

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

arXiv reasoning / agents / evals

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction. Dit is relevant omdat serieuze AI-implementatie valt of staat met evaluatie, betrouwbaarheid en begrip van nieuwe failure modes.

#agent #evals #implementation #research-evals

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code

arXiv reasoning / agents / evals

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code. Dit is relevant omdat serieuze AI-implementatie valt of staat met evaluatie, betrouwbaarheid en begrip van nieuwe failure modes.

#agent #builder #implementation #research-evals

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

arXiv reasoning / agents / evals

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports. Dit is relevant omdat serieuze AI-implementatie valt of staat met evaluatie, betrouwbaarheid en begrip van nieuwe failure modes.

#agent #evals #research-evals #systems-framing

Frontier post-training recipe review with Finbarr Timbers

Interconnects

Frontier post-training recipe review with Finbarr Timbers. Dit is relevant omdat serieuze AI-implementatie valt of staat met evaluatie, betrouwbaarheid en begrip van nieuwe failure modes.

#evals #research-evals

— a still that plays

Simon Willison

— a still that plays. Dit is relevant omdat de builderlaag rond AI concreter wordt: tools, runtimes en ontwikkelworkflows bepalen steeds vaker de echte hefboom.

#tooling-runtime

Microsoft's Copilot Cowork moves to usage-based billing and may tap DeepSeek

The Decoder

Microsoft's Copilot Cowork moves to usage-based billing and may tap DeepSeek. Dit is relevant omdat serieuze AI-implementatie valt of staat met evaluatie, betrouwbaarheid en begrip van nieuwe failure modes.

#evals #research-evals

Berlin court rules Google's AI Overviews are just a new search format, not original content

The Decoder

Berlin court rules Google's AI Overviews are just a new search format, not original content. Dit is relevant omdat serieuze AI-implementatie valt of staat met evaluatie, betrouwbaarheid en begrip van nieuwe failure modes.

#evals #research-evals

Anthropic backs off unpopular billing overhaul as price war with OpenAI looms

The Decoder

Anthropic backs off unpopular billing overhaul as price war with OpenAI looms. Dit is relevant omdat het laat zien waar duurzame waarde in de AI-stack kan blijven hangen na de hype.

#agent #builder #market-strategy

Exclusive eBook: How AI is becoming the next military advisor

MIT Technology Review AI

Exclusive eBook: How AI is becoming the next military advisor. Dit is relevant omdat modelkeuze steeds meer een architectuurvraag wordt rond kosten, context, latency en controle.

#models-architecture

Monday.com (NASDAQ: MNDY) Launches AI Work Platform To Drive Workflow Automation And Sustain Revenue Growth - foreignpolicyjournal.com

Google News AI Lab Watch

Monday.com (NASDAQ: MNDY) Launches AI Work Platform To Drive Workflow Automation And Sustain Revenue Growth - foreignpolicyjournal.com. Dit is relevant omdat adoptie pas telt zodra AI zichtbaar in dagelijkse processen en operating models landt.

#agent #implementation-adoption #systems-framing