gpt 5 5

3 articles

← All topics

Cloudflare `wrangler deploy --temporary` for AI agents
cloudflarecloudflare-workerswranglerai-agentsagentic-ai+10

On 2026-06-19 Cloudflare shipped `wrangler deploy --temporary`, a CLI flag that provisions a temporary Cloudflare account, deploys a Worker to a workers.dev URL, and prints a claim URL — no human in the loop, no API token, no OAuth. The temporary account expires in 60 minutes unless the user claims it via the URL. Same day, the Cloudflare developer documentation page 'Claim deployments (temporary accounts)' documented the full flow, the supported-products table, and the abuse-prevention posture. On 2026-06-21 Simon Willison independently confirmed the flow with GPT-5.5 xhigh in Codex Desktop, redeploying a redirect-resolver Worker end-to-end. Wrangler 4.102.0 or later is required. The supported products and limits are narrow and explicit: Workers, Workers Static Assets (≤1,000 files, ≤5 MiB each), Workers KV, D1 (one database, ≤100 MB), Durable Objects, Hyperdrive (≤2 configs, ≤10 connections), Queues (≤10), and SSL/TLS. This is a Cloudflare product feature, not an industry standard.

LifeSciBench: GPT-Rosalind 36.1%, artifact gap 17pts
openailifesci-benchlifescibenchlife-sciencesbenchmark+14

On 2026-06-17, OpenAI published LifeSciBench, a 750-task, 1,062-artifact, 19,020-criterion life-sciences evaluation built with 173 PhD-level scientists and 453 independent reviewers. GPT-Rosalind reports a 36.1% exact pass rate vs 25.7% for GPT-5.5, with the largest gains in Scientific Communication (56.3% → 71.1%, n=9) and Translation (36.8% → 57.7%). The under-reported finding is the artifact-handling gap: GPT-Rosalind drops from 45.1% on text-only tasks to 28.1% on tasks with artifacts or URLs — a 17-percentage-point drop. Design/Optimization (30.7%) and Analysis (30.3%) barely moved. LifeSciBench is a self-report by the model owner, no third-party reproduction exists, and GPT-Rosalind access is gated by a request form. The article leads with the artifact gap, preserves all five load-bearing caveats from the brief, and does not invent head-to-head comparisons against GeneBench or BixBench.

NVIDIA ENPIRE: real-robot coding agents hit 99% pass@8
nvidiacmuuc-berkeleyenpirerobotics+17

On 2026-06-16, NVIDIA GEAR, CMU LeCAR Lab, and UC Berkeley published ENPIRE, a four-module harness (Environment, Policy Improvement, Rollout, Evolution) that puts coding agents (Codex with GPT-5.5, Claude Code with Opus 4.7, Kimi Code with Kimi K2.6) in a fully automatic closed loop on real robots, with auto-reset and auto-verify. The team reports 99% pass@8 across five hard manipulation tasks (Push-T, Pin Insertion, Tie Zip-tie, GPU Insertion, Cut Zip-tie), team-size scaling 1/4/8, and two new multi-agent physical-autoresearch efficiency metrics — Mean Robot Utilization (MRU) and Mean Token Utilization (MTU). The 99% figure is the team's emergent retry-and-recovery capability, not best-of-8 sampling; a heuristic-policy baseline reports 0% coverage in 43–73 steps. The harness code is not yet open-sourced as of 2026-06-19.