AI Intelligence

AI Intelligence Briefing - March 8, 2026

Vijay Bhagwati

08 Mar 2026 • 6 min read

AI Intelligence Briefing

Sunday, March 8th, 2026

📋 EXECUTIVE SUMMARY

Top 5 Stories:

OpenAI GPT-5.4 Launches: Native Computer Use + Excel Integration - Computer-using agent beats 72% human baseline; new Finance suite integrates with Excel/Sheets; $17.50 per 1M tokens (US)
Microsoft Phi-4-reasoning-vision-15B: When to Think, When Not To - 15B model matches 32B rivals on 1/5th training data; mixed reasoning (20%/80%); 2.5× faster, $0 cost (US)
Google Gemini 3.1 Flash-Lite: 1/8th Pro Cost, 2.5× Faster - Lightning-fast inference (363 tok/s); dynamic thinking levels; $1.75 per 1M total vs $14 for Pro (US)
Black Forest Labs Self-Flow: 2.8× Faster Training, No External Teachers - Self-supervised learning replaces CLIP/DINOv2; 50× faster convergence vs vanilla; open inference code released (Open Source)
OpenAI Robotics Head Resigns Over Pentagon Deal - Caitlin Kalinowski quits citing warrantless surveillance concerns and "lethal autonomy without human authorization" (US)

Key Themes: The week closes with three massive model releases showcasing divergent efficiency strategies—OpenAI doubling down on computer-using agents and enterprise integrations, Microsoft proving small models can reason selectively, and Google racing to the bottom on cost. Meanwhile, Black Forest Labs rewrites training fundamentals by eliminating external dependencies, and OpenAI faces internal backlash over its Pentagon contract expanding into controversial territory.

Geographic Coverage: United States (4 stories), Open Source (1 story)

Next 24h Watch: Will other labs release computer-use capabilities? How will enterprises respond to GPT-5.4's pricing vs Gemini Flash-Lite? Can Self-Flow training be independently reproduced? What happens to OpenAI's robotics division post-resignation?

STORY 1: 🤖 AGENTIC AI & WORKFLOWS - OpenAI GPT-5.4 Launches: Native Computer Use + Excel Integration

Why it matters: OpenAI released GPT-5.4 (Thinking and Pro variants), its first general-purpose model with native, state-of-the-art computer-use capabilities. The model can navigate desktops using screenshots plus keyboard/mouse commands, achieving 75.0% success on OSWorld-Verified (human baseline: 72.4%) and 89.3% on BrowseComp (web browsing). Beyond agentic capabilities, OpenAI launched ChatGPT for Excel and Google Sheets (beta) with direct cell integration, plus partnerships with FactSet, MSCI, Third Bridge, and Moody's for enterprise finance workflows. On an internal investment banking benchmark, GPT-5.4 jumped from 43.7% (GPT-5) to 88.0%. The model also introduces tool search, reducing token usage by 47% on tasks with 36 MCP servers by retrieving tool definitions on-demand rather than loading all upfront. Available via API/Codex at $2.50/$15 per 1M tokens (Thinking) and $30/$180 (Pro).

The Gist:

Native computer use: screenshot-driven desktop/web navigation, Playwright integration
OSWorld-Verified: 75.0% (GPT-5.4) vs 47.3% (GPT-5.2) vs 72.4% (human baseline)
BrowseComp: 89.3% (GPT-5.4 Pro) vs 72.3% (GPT-5.2) — new SOTA
Tool search: 47% token reduction on 250-task MCP Atlas benchmark with 36 servers
Finance suite: Excel/Sheets cell integration, FactSet/MSCI/Third Bridge/Moody's partnerships
Internal banking benchmark: 43.7% → 88.0% (GPT-5 → GPT-5.4 Thinking)
Pricing: $17.50 total per 1M tokens (GPT-5.4 Thinking), $210 total (Pro)
33% fewer false claims on user-flagged factual errors vs GPT-5.2

STORY 2: 🧠 FRONTIER MODELS - Microsoft Phi-4-reasoning-vision-15B: When to Think, When Not To

Why it matters: Microsoft released Phi-4-reasoning-vision-15B, a 15-billion-parameter multimodal model trained on just 200 billion multimodal tokens (1/5th of rivals like Qwen3-VL and Gemma3, which consumed 1+ trillion). The model's key innovation: a "mixed reasoning and non-reasoning" architecture where 20% of training data includes explicit reasoning traces and 80% uses for direct response. The model autonomously decides when to invoke structured reasoning (math, science) vs. fast direct response (captions, OCR). On benchmarks, it scored competitively with much larger systems—84.8 on AI2D, 83.3 on ChartQA, 75.2 on MathVista—while converging 2.8× faster than REPA (the previous state-of-the-art alignment method). Available immediately via Microsoft Foundry, HuggingFace, and GitHub under permissive license.

The Gist:

15B params, trained on ~200B multimodal tokens (1/5th of Qwen3-VL, Gemma3 at 1T+)
Mixed reasoning: 20% traces, 80% direct response
Model autonomously decides when to reason vs. respond fast
Benchmarks: 84.8 AI2D, 83.3 ChartQA, 75.2 MathVista, 88.2 ScreenSpot v2, 54.3 MMMU
Trails Qwen3-VL-32B but competitive with similarly-sized models (Qwen3-VL-8B, Kimi-VL-A3B)
Converges 2.8× faster than REPA; ~50× faster than vanilla training (7M → 143K steps)
SigLIP-2 Naflex vision encoder with up to 3,600 max tokens (~720p native resolution)
Open-weight release: Microsoft Foundry, HuggingFace, GitHub; fine-tuning code + benchmark logs public

STORY 3: 💰 AI ECONOMICS & BUSINESS MODELS - Google Gemini 3.1 Flash-Lite: 1/8th Pro Cost, 2.5× Faster

Why it matters: Google released Gemini 3.1 Flash-Lite, the most cost-efficient model in the Gemini 3 series, priced at $0.25 input / $1.50 output per 1M tokens—1/8th the cost of Gemini 3.1 Pro ($2/$12) and cheaper than its predecessor Gemini 2.5 Flash ($0.30/$3.00). The model delivers 2.5× faster time to first token and 45% faster overall output speed (363 tok/s vs 249 tok/s) compared to 2.5 Flash. It introduces "thinking levels" that let developers modulate reasoning intensity dynamically—dial down for speed on simple tasks, dial up for complex logic. Benchmarks show 86.9% on GPQA Diamond, 76.8% on MMMU-Pro, 88.9% on MMMLU, and an Elo score of 1432 on Arena.ai. Early testers report 100% consistency in item tagging (Whering), 97% structured output compliance (HubX), and 20% higher success rates with 60% faster inference (Latitude).

The Gist:

Pricing: $0.25 input, $1.50 output per 1M tokens ($1.75 total) — 1/8th cost of 3.1 Pro ($14 total ≤200K)
2.5× faster time to first token vs Gemini 2.5 Flash; 45% faster output (363 tok/s vs 249 tok/s)
Dynamic "thinking levels": modulate reasoning intensity per task (speed vs. depth trade-off)
Benchmarks: 86.9% GPQA Diamond, 76.8% MMMU-Pro, 88.9% MMMLU, 72.0% LiveCodeBench, 73.2% CharXiv Reasoning
Elo score: 1432 on Arena.ai Leaderboard (competitive with much larger models)
Early adopter results: 100% tagging consistency (Whering), 97% structured output (HubX), 20% success lift + 60% speed gain (Latitude)
Available via Google AI Studio and Vertex AI (preview status)

STORY 4: 🌐 OPEN SOURCE AI - Black Forest Labs Self-Flow: 2.8× Faster Training, No External Teachers

Why it matters: Black Forest Labs (makers of FLUX) released Self-Flow, a self-supervised flow matching framework that eliminates the need for external semantic encoders like CLIP or DINOv2—addressing the "teacher bottleneck" where model scaling no longer improves results because frozen external encoders hit their limits. Self-Flow uses Dual-Timestep Scheduling: the student sees heavily corrupted data while the teacher (an EMA version of itself) sees cleaner data, forcing the student to predict what its cleaner self is seeing. This "self-distillation" approach converges 2.8× faster than REPA and doesn't plateau as compute scales. Trained on 200M images, 6M videos, and 2M audio-video pairs, a 4B parameter model achieved superior FID (3.61 vs 3.92), FVD (47.81 vs 49.59), and FAD (145.65 vs 148.87) scores. The framework also enables joint video-audio synthesis and improves robotic task success rates in SIMPLER simulations. Inference code released on GitHub for ImageNet 256×256 generation.

The Gist:

Eliminates external encoders (CLIP, DINOv2) — solves "teacher bottleneck" where scaling model size stops helping
Dual-Timestep Scheduling: student sees corrupted data, teacher (EMA of itself) sees cleaner data
Student predicts what its cleaner self sees → self-distillation, deep internal semantic understanding
Converges 2.8× faster than REPA (143K vs 400K steps to baseline); ~50× faster than vanilla (7M → 143K steps)
4B param model trained on 200M images, 6M videos, 2M audio-video pairs
Benchmarks: FID 3.61 (vs REPA 3.92), FVD 47.81 (vs 49.59), FAD 145.65 (vs 148.87)
Enables joint video-audio synthesis, improved typography/text rendering, reduced temporal artifacts
Robotics: 675M param version achieved higher success rates on RT-1 "Open and Place" tasks in SIMPLER
Inference code released: https://github.com/black-forest-labs/Self-Flow/

STORY 5: 🏢 IT TRANSFORMATION & ENTERPRISE AI - OpenAI Robotics Head Resigns Over Pentagon Deal

Why it matters: Caitlin Kalinowski, OpenAI's head of robotics, resigned over the company's expanded Pentagon contract, stating on X that the agreement didn't do enough to protect Americans from warrantless surveillance and that granting AI "lethal autonomy without human authorization" was a line that "deserved more deliberation." Her resignation follows OpenAI's announcement of a new contract with the Department of Defense, which sparked internal backlash and external protests. The resignation raises questions about OpenAI's ability to retain top talent as it pursues military contracts, particularly in robotics and embodied AI—a domain where safety and ethical considerations around autonomous weapons systems are acutely sensitive. Kalinowski's departure comes amid broader tension between OpenAI's stated mission of beneficial AGI and its increasingly close ties to military and intelligence agencies.

The Gist:

Caitlin Kalinowski (OpenAI head of robotics) resigned March 7, 2026
Posted on X: Pentagon deal insufficient to protect against warrantless surveillance
Criticized "lethal autonomy without human authorization" as needing more deliberation
Follows OpenAI's recent expanded contract with Department of Defense
Resignation signals internal dissent over military partnerships
Raises talent retention concerns as OpenAI pursues defense contracts
Context: OpenAI facing external protests (QuitGPT campaign, 1.5M+ participants)
Broader tension: AGI safety mission vs. military/intelligence partnerships

Sources: OpenAI Blog, Microsoft Research, Google AI Blog, Black Forest Labs Research, The Verge, VentureBeat
Next Briefing: Monday, March 9th, 2026 at 08:00 EST