LifeSciBench: GPT-Rosalind 36.1%, artifact gap 17pts

openailifesci-benchlifescibenchlife-sciencesbenchmarkgpt-rosalind+13
LifeSciBench 16:9 illustration card
Image: OpenAI / LifeSciBench announcement (June 17, 2026)

On 2026-06-17, OpenAI published LifeSciBench — a 750-task, 1,062-artifact, 19,020-criterion life-sciences evaluation, written and reviewed by 173 PhD-level scientists with biotech and pharma experience and validated against feedback from 453 independent expert reviewers (OpenAI, Introducing LifeSciBench, 2026-06-17). The headline result is that GPT-Rosalind — a new OpenAI life-sciences model the announcement introduces on a “request access” basis — reaches a 36.1% exact pass rate, compared with 25.7% for GPT-5.5 (OpenAI, Introducing LifeSciBench, 2026-06-17). The under-reported finding is the one the article leads with: GPT-Rosalind’s pass rate drops from 45.1% on text-only tasks to 28.1% on tasks with artifacts or URLs — a 17-percentage-point drop (OpenAI, Introducing LifeSciBench, 2026-06-17). The honest reading is that frontier models are getting better at talking about life-sciences work than at doing it on real lab data. Three caveats lead the article: LifeSciBench is a self-report by the model owner, no third-party reproduction exists as of 2026-06-20, and the strongest reported gain (Scientific Communication 56.3% → 71.1%) is on nine tasks.

What it is

The benchmark. LifeSciBench contains 750 expert-authored tasks across seven biological domains and seven workflows (OpenAI, Introducing LifeSciBench, 2026-06-17). Each task ships with 1,062 attached artifacts — figures, PDFs, tables, sequence files, structure or chemical files, and web references — and is graded against a per-task rubric that breaks the expected response into specific scientific claims, calculations, decisions, and justifications. Across the benchmark, those rubrics contain 19,020 criteria, an average of 25 per task, developed by the same expert scientists who wrote the tasks (OpenAI, Introducing LifeSciBench, 2026-06-17). Tasks were validated by 453 independent expert reviewers who were not involved in writing them; of those, 97% hold a PhD or equivalent doctorate, with an average of 12 years of field experience and 14 peer-reviewed publications, and 88% have received at least one award or fellowship (OpenAI, Introducing LifeSciBench, 2026-06-17). Reviewer agreement exceeded 96% in every category — real-world relevance, scientific reasoning, scientific grounding, and overall usefulness (OpenAI, Introducing LifeSciBench, 2026-06-17).

The seven workflows. OpenAI’s taxonomy is the kind of vocabulary biotech R&D leaders can use, and it is the genuine contribution. The seven workflows, as named on the announcement page, are: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication (OpenAI, Introducing LifeSciBench, 2026-06-17). The taxonomy was built by surveying practicing life scientists about the workflows they use most often in applied research, then grouping their responses into the seven categories. Tasks are structured like a request a scientist might give to a knowledgeable collaborator — a scientific prompt, any relevant context or artifacts, and a free-response answer.

Task complexity. The benchmark is designed to reflect the complexity of life-sciences work, not a multiple-choice recall test. 79% of tasks require multiple reasoning or decision-making steps, with an average of four steps per task, and 53% of tasks require models to interpret or synthesize information from at least one artifact (OpenAI, Introducing LifeSciBench, 2026-06-17). Each task goes through an average of six self-directed automated review cycles and at least two rounds of expert review, with reviewer agreement anchored at ≥90% per task in the relevant domain (OpenAI, Introducing LifeSciBench, 2026-06-17).

The grading surface — and why it matters. LifeSciBench reports two complementary metrics. Pass rate is the percentage of tasks on which a model meets the task-level success threshold of 70% on the rubric. Score is the average rubric reward, giving partial credit for individual criteria even when the full task is not solved (OpenAI, Introducing LifeSciBench, 2026-06-17). The two together are the load-bearing distinction in the article: the 36.1% / 25.7% headline is a 70% rubric threshold, not a clean “right answer,” and the rubric-reward number is the more honest capability signal because it captures partial credit. On tasks requiring expert-useful or actionable outputs, GPT-Rosalind scores 44.7% vs 29.1% for GPT-5.5; on tasks requiring uncertainty and caveat handling, GPT-Rosalind scores 44.8% vs 29.3% for GPT-5.5 (OpenAI, Introducing LifeSciBench, 2026-06-17). The improvement is concentrated in the “useful” axis, not the “fully solved” axis.

The two models compared. OpenAI’s announcement compares GPT-5.5 and GPT-Rosalind, with the latter introduced on a “Request access” basis at the bottom of the announcement page (OpenAI, Introducing LifeSciBench, 2026-06-17). The article uses the model names exactly as the announcement does and does not describe GPT-Rosalind as a “release” — the framing in the article is “introduced” / “previewed” pending the access form.

The artifact-handling gap (the under-reported finding)

The single most useful number in the announcement is buried in the “Where AI systems still fall short” section. GPT-Rosalind’s pass rate drops from 45.1% on text-only tasks to 28.1% on tasks with artifacts or URLs — a 17-percentage-point drop. GPT-5.5 drops the same way, from 29.9% to 21.9% (OpenAI, Introducing LifeSciBench, 2026-06-17). The pattern is consistent: both models are better on the text-only subset than on the artifact-heavy subset, and the leading model (GPT-Rosalind) drops the same amount as the trailing one. OpenAI’s own analysis confirms that frontier models struggle at extracting information from complex figures or large sequence files and integrating that information into the final answer (OpenAI, Introducing LifeSciBench, 2026-06-17).

For AI builders shipping agents that touch real lab data — PDFs of papers, sequence files, structure files, gel images, instrument exports — this is the number to plan around. The 36.1% headline is a ceiling on text-only performance with full rubric credit. Real research data, with all its format noise, will see the pass rate drop into the high 20s. Any “AI for life-sciences R&D” product that does not instrument the artifact-handling subset specifically is benchmarking against the wrong number.

The hardest workflows — barely moved. Two workflows are the load-bearing weakness. Design, Optimization, & Prediction sits at 30.7% for GPT-Rosalind, and Analysis sits at 30.3% — both within a few points of GPT-5.5’s text-only baseline (OpenAI, Introducing LifeSciBench, 2026-06-17). The improvement is concentrated in the talking-about workflows (Scientific Communication, Translation), not the designing-and-analyzing workflows. For biotech R&D leads, that is the most actionable signal: frontier models are getting better at producing expert-facing prose and at bench-to-bedside translation, and not measurably better at the design-and-analysis work that the bench actually needs.

Exact-output brittleness. The same pattern shows up on output format. Tasks requiring exact sequence, structure, or construct-level outputs have lower pass rates: GPT-Rosalind reaches 14.8% on numeric tasks and 24.0% on sequence or structure outputs; construct-generation is 27.3% (OpenAI, Introducing LifeSciBench, 2026-06-17). The announcement is explicit that some of this is grading surface — small differences in calculation or formatting can push a response under threshold — but it is also explicit that the failures are scientifically meaningful, because many life-sciences workflows require outputs that are exact enough to be used directly, such as in CRISPR/HDR donor design or siRNA design.

Partial-credit pattern. Models often get part of the way there without fully solving the task. In roughly 14% of tasks, models earned substantial rubric credit despite failing the exact-pass threshold; for GPT-Rosalind, 109 tasks had pass rates below 20% while still earning at least 50% rubric reward (OpenAI, Introducing LifeSciBench, 2026-06-17). The honest read: models can identify relevant evidence or produce a plausible partial answer, but still fail because they miss a key constraint, use the wrong evidence, make an incomplete calculation, or do not connect their reasoning to a scientifically useful final decision.

Where the gains actually are

The headline numbers are the scientific synthesis and structured-interpretation workflows. The largest reported gain is Scientific Communication (56.3% → 71.1%), but the announcement flags that this category is small (n=9) and should be interpreted cautiously (OpenAI, Introducing LifeSciBench, 2026-06-17). The article does not generalize “AI is getting better at scientific communication” from n=9; it quotes the page’s caveat in the same paragraph.

The second largest gain is Translation — the bench-to-bedside process of drug development — which rises from 36.8% for GPT-5.5 to 57.7% for GPT-Rosalind (OpenAI, Introducing LifeSciBench, 2026-06-17). Translation is a much larger category (no n=9 caveat), and the gain is the strongest clean signal in the announcement: frontier models are improving rapidly on the ability to connect preclinical evidence to clinical implications. For readers building tools for regulatory submissions, IND-enabling study design, or clinical narrative drafting, that is a directional signal worth tracking — but it is a self-reported directional signal, and the article makes that explicit.

The rubric-level numbers reinforce the same direction. On tasks requiring expert-useful or actionable outputs, GPT-Rosalind is +15.6 points above GPT-5.5; on tasks requiring uncertainty and caveat handling, +15.5 points (OpenAI, Introducing LifeSciBench, 2026-06-17). OpenAI’s reading is that models are most useful when the task has a clear evidence boundary and calls for structured scientific judgment (OpenAI, Introducing LifeSciBench, 2026-06-17). That is the cleanest take in the announcement, and it is the one the article preserves.

Why it matters

The first industry life-sciences benchmark, not the first life-sciences benchmark. Prior life-sciences AI benchmarks — including but not limited to GeneBench and BixBench, the two the candidate brief asked about — exist, but they typically focus on narrow domains, isolated skills, or structured question formats with clean reference answers, per OpenAI’s framing on the announcement page (OpenAI, Introducing LifeSciBench, 2026-06-17). AI Newsroom has not independently verified the methodology of GeneBench or BixBench as of 2026-06-20, and the LifeSciBench announcement does not position itself head-to-head against those benchmarks (OpenAI, Introducing LifeSciBench, 2026-06-17). The honest framing: LifeSciBench is a new entry into a small set of life-sciences AI benchmarks, distinct in being built explicitly around the workflows biotech and pharma R&D leads say they actually use, and the only head-to-head numbers in the article are GPT-Rosalind vs GPT-5.5 — both OpenAI models, both run by OpenAI.

The 19,020-criterion rubric is the real contribution. A response can reach the correct high-level conclusion and still be judged incomplete if it overlooks a key assay limitation, fails to bring up a biologically consequential nuance, or does not format the answer the way a scientist would expect. Conversely, a partial response can contain high-quality reasoning even if it does not fully solve the task (OpenAI, Introducing LifeSciBench, 2026-06-17). For anyone building AI-for-science products, this is the first benchmark that asks “is the model useful in a research meeting?” rather than “did the model get the right answer?” — and the rubric-reward score (44.7% vs 29.1% on expert-useful outputs) is the more useful number than the pass rate.

A taxonomy biotech R&D leads can use. The seven-workflow taxonomy — evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, scientific communication — maps more cleanly onto how a biotech R&D team talks about its own work than a benchmark scored on biology-recall accuracy ever could. For R&D leads, the immediate value of LifeSciBench is not the 36.1% number, it is having a common vocabulary with their model-evaluation partners about which of seven workflow types a given tool is and is not good at.

What to watch

Six follow-up signals to track over the next quarter:

  1. Public release of GPT-Rosalind access pathways. The announcement ends with a “Request access” form, not a release date (OpenAI, Introducing LifeSciBench, 2026-06-17). Watch for whether access opens to named institutions, broadens to API, or remains gated. The article does not describe GPT-Rosalind as “available” — the framing is “introduced” / “previewed.”
  2. Third-party reproduction of the 36.1% / 25.7% delta. The benchmark is published by the model owner; no public code or task data is referenced on the announcement as of 2026-06-20 (OpenAI, Introducing LifeSciBench, 2026-06-17). Watch for arXiv replications, academic lab reproductions, or community runs on subsets of the 750 tasks.
  3. AI Chemist and Deployment Simulation follow-ups. The same week, OpenAI published “A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry” (2026-06-17) and “Predicting model behavior before release by simulating deployment” (2026-06-16) (OpenAI, AI Chemist, 2026-06-17; OpenAI, Deployment Simulation, 2026-06-16). Both are linked from the LifeSciBench announcement page under “Keep reading.” Watch for a unified deployment-study paper that connects the three — LifeSciBench (evaluation), AI Chemist (real wet-lab task), Deployment Simulation (model-behavior prediction) — into a single research-program claim.
  4. A “Design, Optimization, & Prediction” leaderboard update. This is the workflow with the most room to move and the most direct biotech value. Watch for whether subsequent model versions (GPT-5.6, GPT-Rosalind updates, third-party open models) move the 30.7% number in either direction.
  5. Independent artifact-handling evaluations. The 17-percentage-point drop is the most useful number in the announcement for AI builders, and it is also the number most likely to vary by artifact type. Watch for evaluations that break the artifact subset into figures, PDFs, sequence files, structure files, and web references separately — the announcement’s “A more detailed analysis” sentence promises a deeper look but does not link to one as of 2026-06-20 (OpenAI, Introducing LifeSciBench, 2026-06-17).
  6. Head-to-head runs against non-OpenAI models. The announcement only compares two OpenAI models. Watch for academic or community runs that put Claude Opus 4.8, Gemini 3.x, Kimi K3.x, or open-weight life-sciences models against the same 750 tasks once the rubric or task data is public.

Risks and caveats

Five load-bearing caveats — none of these is a reason not to write the article, but each one has to be in the body, not just a footnote:

  1. Self-report by the model owner. OpenAI is both the benchmark publisher and the model owner. Treat the 36.1% / 25.7% as a self-reported delta until a third party runs the same evaluation. The preprint PDF is the only methodology disclosure available; no public code or task data is referenced in the announcement as of 2026-06-20 (OpenAI, Introducing LifeSciBench, 2026-06-17; OpenAI, LifeSciBench preprint, 2026-06-17). The article does not call the 36.1% a community-validated result.
  2. The 36.1% pass rate is a 70% rubric threshold, not a clean “right answer.” Tasks are graded on whether the model meets the task-level success threshold of 70% on the rubric (OpenAI, Introducing LifeSciBench, 2026-06-17). The full rubric-reward number (which captures partial credit) is the more honest capability signal — and that gap is what shows GPT-Rosalind’s main improvement is in “useful” outputs rather than “fully solved” tasks.
  3. Scientific Communication n=9. The strongest reported gain is on 9 tasks, not 90 or 900. The article does not generalize “AI is getting better at scientific communication” from n=9, and quotes the page’s own caveat in the same paragraph: “this category is small (n=9), so it should be interpreted cautiously” (OpenAI, Introducing LifeSciBench, 2026-06-17).
  4. Artifact handling is the load-bearing weakness. GPT-Rosalind’s pass rate drops 17 percentage points when tasks include figures, PDFs, sequence files, structure files, or web references — from 45.1% on text-only tasks to 28.1% on tasks with artifacts or URLs (OpenAI, Introducing LifeSciBench, 2026-06-17). This is the under-reported finding and the one most useful to AI builders shipping agents that touch real lab data. The article leads or co-leads with this.
  5. Real research is iterative; LifeSciBench is not. The page’s own “Limitations & what’s next” section is explicit: “Strong performance on LifeSciBench should therefore be interpreted as evidence of realistic task-level capability, not as a direct measure of downstream research impact” (OpenAI, Introducing LifeSciBench, 2026-06-17). The next step OpenAI names is deployment studies in live research workflows — that is the proof point the benchmark cannot supply.

Practical advice for builders

ML engineers and AI-product owners evaluating life-sciences AI claims. The 36.1% headline is a ceiling on current capability, not a floor. The artifact-handling gap (45.1% → 28.1%) means any agent reading real lab data — PDFs of papers, sequence files, structure files, gel images, instrument exports — will see a sharp drop. If your evaluation only tests text-only subsets of a life-sciences benchmark, you are measuring the upper bound. If you cannot get a model above 30% on the artifact-heavy subsets, you are not yet in the regime where the model is useful on real R&D work.

Life-sciences R&D leads sizing up frontier-model capabilities. The rubric-reward gap (29.1% → 44.7% on expert-useful outputs) is the more useful number than the pass rate — it tells you whether the model is producing something a scientist can actually use, not whether the model hit a 70% rubric threshold on a single pass. The Translation gain (36.8% → 57.7%) is a directional signal for bench-to-bedside tooling, and the Design/Optimization (30.7%) and Analysis (30.3%) numbers are the most honest “this is not solved yet” signal in the announcement.

Evaluators and procurement teams. The benchmark is owned and published by the model maker. Reproduce before relying on it. As of 2026-06-20, no public code or task data is referenced in the announcement, and the LifeSciBench preprint PDF is the only methodology disclosure (OpenAI, Introducing LifeSciBench, 2026-06-17; OpenAI, LifeSciBench preprint, 2026-06-17). The right first check for any “AI life-scientist” vendor claim is the same one any frontier-model evaluation has to answer: which benchmark, and is the harness public?

Verdict

OpenAI’s first life-sciences evaluation built with working industry scientists shows that frontier models are improving rapidly at scientific prose and at bench-to-bedside translation, and not measurably at the design-and-analysis work that the bench actually needs (OpenAI, Introducing LifeSciBench, 2026-06-17). The headline 36.1% / 25.7% delta is real, the rubric-reward gain on expert-useful outputs (+15.6 points) is the more useful number, and the 17-percentage-point drop on artifact-heavy tasks is the number to plan around. LifeSciBench is a self-report by the model owner, no third-party reproduction exists as of 2026-06-20, GPT-Rosalind access is gated by a request form, and the strongest reported gain (Scientific Communication) is on nine tasks. The article stays inside the announcement; the next step is deployment studies in live research workflows, which OpenAI names as the proof point the benchmark cannot supply.