OpenAI ships ChatGPT health; o3 re-solves 4.8% of rare

On June 18, 2026, OpenAI published two health stories on the same day. The first is a consumer ChatGPT product and evaluation update built on GPT-5.5 Instant that OpenAI reports as rated higher than physician-written responses on a 3,500-response physician panel, with a 71% drop in flagged factuality issues on production health traffic over the last two months (OpenAI, June 18, 2026). The second is a peer-reviewed NEJM AI paper in which researchers from Boston Children’s Hospital’s Manton Center for Orphan Disease Research, Harvard, and OpenAI used OpenAI o3 Deep Research to reanalyze 376 previously unsolved rare-disease cases, surface candidate diagnoses for 18 of them, and report an additional diagnostic yield of 4.8% after expert review under the ACMG/AMP framework and CLIA-certified clinical confirmation (OpenAI, June 18, 2026; NEJM AI, DOI 10.1056/AIcs2501343). The two stories are linked but distinct: the ChatGPT change is a product/eval update for general health questions on a free-tier model; the NEJM AI paper is a retrospective, physician-supervised research workflow on rare-disease reanalysis in which the model never diagnoses a patient and every finding passes through expert review and clinical-laboratory confirmation. Two lead caveats: neither story is evidence that patients or clinicians should use ChatGPT to diagnose disease, and the NEJM AI study is retrospective, on heterogeneous cohorts, with unblinded reviewers and no measurement of time saved or false-positive burden.
What it is
Story A — ChatGPT product/evaluation update. GPT-5.5 Instant is now the default model for free users in ChatGPT, and on OpenAI’s hardest health evaluations, including HealthBench Professional, OpenAI reports it reaches performance comparable to its frontier Thinking models (OpenAI, June 18, 2026). The headline numbers, in OpenAI’s framing:
- 230 million people a week use ChatGPT for health and wellness questions, OpenAI says — making sense of health information, understanding lab results, preparing for appointments, navigating insurance, building healthier habits, and figuring out what to ask next.
- 3,500 reviewed responses on a panel where physicians wrote responses with unlimited time and internet access, then a separate panel of physicians compared those physician-written responses with model responses on accuracy, communication, completeness, instruction following, and health-decision helpfulness. OpenAI’s report: “GPT-5.5 Instant responses were rated higher than physician-written and older model responses across criteria in this evaluation” — the 3,500-response set, the criteria, and the panel are OpenAI’s own.
- A network of 260+ physicians across 60 countries, 49 languages, and 26 medical specialties who have reviewed more than 700,000 example model responses to date, feeding the rubrics and evaluation criteria behind the claim.
- A 71% drop in flagged factuality issues on production health traffic over the last two months, measured by OpenAI’s privacy-preserving monitors on billions of weekly health messages. The 71% number is OpenAI’s report on OpenAI’s traffic.
- The HealthBench evaluation family (released 2025) and the HealthBench Professional extension (introduced 2026) are the open benchmarks the post points to, and the supporting clinical products are ChatGPT for Clinicians (free for verified U.S. clinicians, launched 2026-04-22) and OpenAI for Healthcare (introduced earlier in 2026) (OpenAI, April 22, 2026).
Story B — NEJM AI study. Researchers applied OpenAI o3 Deep Research to 376 previously unsolved cases that had already been through multiple commercial or institutional pipelines and multidisciplinary team review at Boston Children’s Hospital’s Manton Center. The de-identified packet per case was standardized Human Phenotype Ontology (HPO) terms, occasional clinician notes and descriptive clinical diagnosis, age and gender metadata, and a filtered variant table covering rarity, predicted protein effect, ClinVar classification, and signal quality across family members — usually child plus both biological parents. The workflow acted as an explanation-first reasoning layer on top of existing genomic pipelines: instead of returning a ranked gene, the model had to connect clinical features, inheritance pattern, variant evidence, and the scientific literature into a justification a human reviewer could interrogate. Researchers then reviewed outputs using the ACMG/AMP framework — at least two reviewers, disagreements resolved by consensus, model output never treated as a diagnosis — and a finding counted as a diagnosis only after CLIA-certified laboratory confirmation and clinical team return to the family (OpenAI, June 18, 2026).
Before the unsolved cases, the team validated the workflow on solved ones. It recovered the correct gene and variant in duplicate runs for 48 of 51 cases across a variety of rare conditions, returned the correct diagnosis in duplicate runs for 45 of 57 neuromuscular cases, and named the correct gene in every case and both disease-causing alleles in 12 of 15 cases in a long-read genome set. The model’s self-reported confidence scores tracked with correctness: mean minimum of 85.6 for consistently correct calls versus 42.1 for incorrect or unknown calls, on previously solved cases. The scores are not calibrated probabilities and were not used as a substitute for evidence; they guided reviewers to focus on the most promising candidates.
The results on the unsolved cases, by cohort:
| Cohort | Cases | Diagnoses surfaced | Yield |
|---|---|---|---|
| Neurodevelopmental | 100 | 10 | 10.0% |
| Neuromuscular disease | 61 | 4 | 6.6% |
| Sudden unexpected death in pediatrics | 200 | 2 | 1.0% |
| Early psychosis | 15 | 2 | 13.3% |
| Total | 376 | 18 | 4.8% |
The early psychosis cohort is small and the percentage has a wide confidence interval, and yield in general reflects how likely each cohort is to have a single-gene explanation. Seven of the 18 diagnoses were rediscoveries — diagnoses established outside the local research workflow but absent from the record the team reviewed. The OpenAI page is explicit: “the variants were already listed as pathogenic or likely pathogenic in public databases, highlighting the operational challenge of synthesizing information across data sources.” That is a data-integration finding, not a model-capability finding (OpenAI, June 18, 2026).
Two worked examples from the same page, useful for the workflow, not the headline number: in an early-psychosis case the model inferred a structural event (a 22q11.2 deletion associated with DiGeorge syndrome) that was not in the input data, and was confirmed by follow-up sequencing. In a neurodevelopmental case, the model highlighted an 11-amino-acid deletion in S1PR1 in a person with vitiligo and proposed “a possible novel mechanistic explanation” for the vitiligo, which the post itself flags as requiring “additional experimental validation.” A neuromuscular case study — Kyra, diagnosed with a form of myofibrillar myopathy linked to a frameshift variant in HSPB8 after a near-20-year diagnostic journey — is the human-story paragraph the article should carry, framed as one case among the four neuromuscular diagnoses, not a generalization.
Why it matters
Three reasons.
1. Two OpenAI health stories in one day is itself a signal. The product/evaluation update and the peer-reviewed study landed on the same day, on the same news index. One is a consumer ChatGPT improvement (rated higher than physician-written responses on OpenAI’s own 3,500-response panel, a 71% drop in factuality flags on OpenAI’s own production health traffic, GPT-5.5 Instant now free for all users); the other is a peer-reviewed research study with a named children’s hospital, a top medical journal, an NEJM AI DOI, and a defined workflow. Together they describe a broader OpenAI health push than “we added a feature.”
2. The NEJM AI study is the under-reported story. Most coverage of AI in medicine is forward-looking (“AI will help doctors”). This is a specific, named, peer-reviewed result with a defined workflow (o3 Deep Research + the standard ACMG/AMP framework + CLIA-certified confirmation), a defined cohort (376 unsolved cases that had already been through expert pipelines), and a defined yield (4.8% additional diagnoses, of which 7 of 18 were rediscoveries already in public databases). The article should lead with the workflow and the yield, not with “AI is changing medicine” (OpenAI, June 18, 2026; NEJM AI, DOI 10.1056/AIcs2501343).
3. The clinical / regulatory boundary is the load-bearing caveat. The OpenAI page is explicit: “This research is not evidence that patients, clinicians, or customers should use OpenAI models to diagnose disease or make medical decisions. It does not describe or endorse an intended customer use of OpenAI o3 Deep Research, ChatGPT, or any other OpenAI product for diagnosis. The model did not diagnose any participant; physicians and other qualified clinical experts made every diagnosis through established review, testing, and clinical-confirmation processes.” The Limitations section is equally explicit: the study was retrospective, the cohorts were heterogeneous, reviewers were not blinded to model confidence, and the researchers did not measure time saved, cost, clinician effort, false-positive workload, or changes in care (OpenAI, June 18, 2026). Any article that drops the clinical/regulatory boundary overclaims.
What to watch
- Prospective, multi-center studies comparing LLM-assisted reanalysis with standard practice on diagnostic yield, time to a candidate, clinician effort, false-positive burden, cost, and effects on care. This is the study’s own call to action in the “What comes next” section, and it is the right next data point (OpenAI, June 18, 2026).
- The OpenAI Foundation grant to the Manton Center to develop a “platform-agnostic, low-cost genetics AI copilot” for clinical teams. The grant shifts the next stage of the work to the Manton Center itself, with OpenAI in a supporting role; the field is watching whether the workflow generalizes outside the original research team.
- Regulatory clarity, especially FDA. The article must not claim FDA clearance. The NEJM AI study is a research study, not a cleared device; ChatGPT is a general-purpose consumer product, not a medical device. OpenAI does not claim FDA clearance on either page.
- HIPAA / privacy disclosures for ChatGPT health use generally. The OpenAI page says the study used de-identified information with no protected health information utilized or transmitted outside approved environments. That is a study-level statement, not a general ChatGPT-for-health HIPAA statement. ChatGPT for Clinicians and OpenAI for Healthcare carry their own BAA / HIPAA posture (OpenAI, April 22, 2026); the consumer ChatGPT for general health questions does not.
- Independent reproduction of the 3,500-response physician-panel results. OpenAI’s claim that GPT-5.5 Instant responses were rated higher than physician-written responses is on OpenAI’s own panel, OpenAI’s own 3,500-response set, OpenAI’s own evaluation criteria. Treat the claim as OpenAI’s report until independent reproduction.
- Competitive context. Google AMIE, Med-PaLM 2, Hippocratic AI, and other clinical AI systems have published comparable results. The article is about the two OpenAI stories; a single short competitive-context paragraph is enough.
- Generalization of the 4.8% yield beyond the four Boston Children’s cohorts. The early-psychosis cohort is small with a wide confidence interval; the sudden unexpected death in pediatrics cohort had the lowest yield at 1.0%; the neurodevelopmental cohort at 10.0% carries most of the absolute weight. A prospective replication on a different population is the right next test.
Risks and caveats
-
The model did not diagnose any patient. The OpenAI o3 Deep Research workflow produced evidence-linked candidate explanations for expert review. Every diagnosis passed through physician review using the ACMG/AMP framework, additional testing, and CLIA-certified clinical confirmation. Quote the OpenAI page verbatim: “The model did not diagnose any participant; physicians and other qualified clinical experts made every diagnosis through established review, testing, and clinical-confirmation processes.” This is the lead caveat for any reader who walks away thinking “ChatGPT diagnoses rare diseases” (OpenAI, June 18, 2026).
-
The 4.8% yield is on previously unsolved cases that had already been through expert pipelines. It is not a 4.8% miss rate, not a 4.8% general-population rate, and not a 4.8% rate for any new case. Quote the OpenAI page’s framing: “That rate is modest but meaningful in this population because previous expert reviews had not resolved the cases. Similar reanalysis studies report single-digit gains in heavily reviewed cases; higher yields usually come from studies containing new cases or well-known disorders awaiting genetic confirmation.” (OpenAI, June 18, 2026).
-
The 7 of 18 “rediscoveries” are operationally important. Seven of the eighteen diagnoses were diagnoses that existed outside the local research workflow but were absent from the local record. The operational challenge is data integration across data sources, not a model capability gap. Quote the OpenAI page: “the variants were already listed as pathogenic or likely pathogenic in public databases, highlighting the operational challenge of synthesizing information across data sources.” The article must say so (OpenAI, June 18, 2026).
-
The ChatGPT product/evaluation story is not the same as the NEJM AI study. GPT-5.5 Instant on the consumer ChatGPT product is a separate artifact from OpenAI o3 Deep Research on the research workflow. The article must not conflate the two, and must not let a reader walk away thinking “ChatGPT can diagnose.” The 3,500-response physician panel is on consumer ChatGPT; the 376-case reanalysis is on o3 Deep Research under expert review (OpenAI, June 18, 2026; OpenAI, June 18, 2026).
-
The “rated higher than physician-written responses” claim is on OpenAI’s own panel, OpenAI’s own 3,500-response set, OpenAI’s own evaluation criteria. Treat the claim as OpenAI’s report until independent reproduction. The article must say so explicitly. The same caveat applies to the 71% drop in flagged factuality issues — that is on OpenAI’s own production-traffic monitors. The article must say so explicitly (OpenAI, June 18, 2026).
-
The study is retrospective on heterogeneous cohorts with unblinded reviewers. OpenAI’s Limitations section names the constraints: the cohorts were heterogeneous; reviewers were not blinded to model confidence; the researchers did not measure time saved, cost, clinician effort, false-positive workload, or changes in care; the model was not tested on structural variants, repeat expansions, deep-intronic changes, or mosaicism. “Large language models can misread context or produce plausible explanations that fail upon closer inspection.” (OpenAI, June 18, 2026).
-
No FDA clearance, no general HIPAA statement for ChatGPT. The NEJM AI study is a research study, not a cleared device. ChatGPT is a general-purpose consumer product, not a medical device. The de-identification statement is a study-level claim, not a general ChatGPT-for-health HIPAA statement. ChatGPT for Clinicians and OpenAI for Healthcare carry their own BAA / HIPAA posture that does not extend to the consumer product by default.
-
The early-psychosis 13.3% cohort is small. Fifteen cases, two diagnoses, a wide confidence interval. Do not lead with the 13.3% number. The 4.8% total and the neurodevelopmental 10.0% cohort are the load-bearing figures.
-
The S1PR1 / vitiligo hypothesis is not a discovery. The OpenAI page itself calls it “a possible novel mechanistic explanation” that “requires additional experimental validation.” The article must use that wording. Same for the HSPB8 and CDK13 phenotype-expansion signals in the neuromuscular cohort, which the page describes as needing more cases and laboratory work.
-
Kyra’s case is one of four neuromuscular diagnoses, not a generalization. The near-20-year diagnostic journey is the human story worth carrying, framed as one case among the four neuromuscular diagnoses, not as evidence that o3 Deep Research routinely solves 20-year diagnostic journeys.
-
No source-prose copying. The OpenAI page and the study are summarized in original English. The only verbatim sentences carried into the article body are the two highest-value clinical-boundary quotes: the model-did-not-diagnose sentence and the retrospective-yield framing sentence. All other claims are paraphrased.
Practical advice for builders
If you are a clinical-genomics builder or rare-disease researcher. The contribution is the workflow architecture, not the headline number: an explanation-first reasoning layer on top of existing genomic pipelines, ACMG/AMP review with at least two reviewers and consensus resolution, and CLIA-certified confirmation. The de-identified packet — HPO terms, occasional clinician notes, age and gender metadata, and a filtered variant table with rarity, predicted protein effect, ClinVar classification, and family-member signal quality — is a useful reference schema. The retrospective workflow does not generalize automatically to prospective clinical care; that generalization is the next research stage, and the article should not promise it.
If you are an AI-for-health product builder. The consumer ChatGPT product change is a different artifact from the NEJM AI research workflow. The article’s value is in distinguishing product/evaluation claims (which can be summarized with the source-citation caveat) from peer-reviewed clinical research (which deserves the ACMG/AMP-style explanation). The OpenAI Foundation grant to the Manton Center for a platform-agnostic, low-cost genetics AI copilot is a useful signal of where the research pipeline is going; the GPT-Rosalind work on variant effects and protein structure is a separate, parallel direction that the OpenAI page mentions but does not test in this study (OpenAI, June 3, 2026). Do not conflate them.
If you are an operator, payer, or clinical-IT leader evaluating AI-in-medicine claims. The right first check is the clinical / regulatory boundary. Was the system studied retrospectively or prospectively? Did the model make the diagnosis, or did physicians, with the model producing reviewable hypotheses? Is there a CLIA-certified laboratory confirmation step? Was the evaluation blinded? The article should help the reader run that check on any future “AI helps doctors diagnose X” claim from a frontier lab.
Verdict
Two stories in one day from one lab, with two separate artifacts, and one load-bearing clinical boundary.
OpenAI shipped two health stories on 2026-06-18. The first is a consumer ChatGPT product/evaluation update on GPT-5.5 Instant that OpenAI reports as rated higher than physician-written responses on a 3,500-response physician panel, with a 71% drop in flagged factuality issues on production health traffic over the last two months, and 260+ physicians across 60 countries, 49 languages, and 26 medical specialties behind the rubric. The second is a peer-reviewed NEJM AI paper in which OpenAI o3 Deep Research reanalyzed 376 previously unsolved rare-disease cases at Boston Children’s Hospital’s Manton Center for Orphan Disease Research, surfaced candidate diagnoses for 18 of them, and reported a 4.8% additional yield after expert review under the ACMG/AMP framework and CLIA-certified clinical confirmation — 7 of 18 were rediscoveries of diagnoses already in public databases. The clinical boundary is the load-bearing caveat: the model produced reviewable hypotheses, physicians made every diagnosis, and the study is retrospective on heterogeneous cohorts with unblinded reviewers. For AI Newsroom’s primary reader, the right takeaway is the workflow — explanation-first reasoning on top of existing genomic pipelines, expert review using ACMG/AMP, CLIA-certified confirmation — not the headline number. The ChatGPT product change is a separate artifact, and OpenAI’s product claims deserve the same independent-reproduction caveat that any vendor health AI claim deserves.
Sources live-verified on 2026-06-20.
- OpenAI, “Using AI to help physicians diagnose rare genetic diseases affecting children” (Applied AI, June 18, 2026) — primary source for the NEJM AI study, the four cohorts, the 4.8% yield, the 7 of 18 rediscoveries, the clinical-boundary sentence, the Limitations section, the Catherine Brownstein and Alan Beggs quotes, the OpenAI Foundation grant, and the GPT-Rosalind reference.
- OpenAI, “Improving health intelligence in ChatGPT” (Product, June 18, 2026) — live URL gated by Cloudflare bot protection from the agent runtime; Wayback Machine snapshot 20260618220358 used. Primary source for GPT-5.5 Instant, HealthBench and HealthBench Professional, 230 million weekly health users, the 3,500-response physician panel, the 260+ physicians / 60 countries / 49 languages / 26 specialties network, the 700,000+ reviewed responses, the 71% drop in flagged factuality issues, and the ChatGPT for Clinicians / OpenAI for Healthcare references.
- NEJM AI study abstract, “Using AI to help physicians diagnose rare genetic diseases affecting children” (DOI 10.1056/AIcs2501343, June 18, 2026) — the journal page returned 403 to the agent webfetch tool on 2026-06-20; the DOI is on the OpenAI page, and the article body relies on the OpenAI page for the study’s design and yield. The DOI is included as the primary citation for the study itself; readers who need the journal page can resolve the DOI directly.
- OpenAI, “Introducing HealthBench” (May 12, 2025) — the open health evaluation behind the GPT-5.5 Instant claims: 5,000 conversations, 48,562 rubric criteria, 262 physicians across 60 countries, 49 languages, 26 specialties, the HealthBench Consensus (3,671 examples) and HealthBench Hard (1,000 examples) variants.
- OpenAI, “Making ChatGPT better for clinicians” (Product, April 22, 2026) — ChatGPT for Clinicians (free for verified U.S. clinicians) and the HealthBench Professional extension; the 6,924-conversation physician-advisor test and the 99.6% physician-rated safe-and-accurate rate.
- OpenAI, HealthBench Professional open benchmark PDF (June 18, 2026) — the supporting benchmark document linked from the ChatGPT for Clinicians post.
- OpenAI, “Introducing new capabilities to GPT-Rosalind” (Product, June 3, 2026) — the life-sciences model mentioned in the NEJM AI study as a separate, parallel direction; not tested in the 376-case study.
- OpenAI, Health Blueprint (responsible integration recommendations in U.S. healthcare, 2026) — OpenAI’s own recommendations for responsible AI integration in U.S. healthcare, released alongside ChatGPT for Clinicians.