GeneralJune 25, 2026

P&ID Extraction Benchmark Test

By Corvic AI

P&ID-to-XML Extraction Benchmark: Comparative Evaluation of Anthropic Claude, ChatGPT, and Corvic AI

Corvic AI Research Internal Technical Benchmark Report — Unit 300, Shipping and Transport College (STC) June 2026

ABSTRACT

This study evaluates three state-of-the-art LLM systems, Anthropic Claude (Opus 4.8 High), OpenAI ChatGPT (GPT-5.5 High), and Corvic AI (Opus 4.8 via workflow), on their ability to convert colour-annotated P&ID drawings into a client-specified XML schema. Two sheets from Unit 300 of the Shipping and Transport College (STC, document 1002452, drawn by Vicoma bv) were used. Each system received the same P&ID image and a reference XML template in a single pass. Outputs were scored blindly across three benchmark categories and 15 granular sub-categories. Corvic ranked first on both sheets, averaging 91.0% client format compliance, 83.5% equipment+instrument+piping accuracy, and 80.5% full usable XML quality. Anthropic ranked second (87.0%, 63.8%, 63.0%) and ChatGPT third (83.5%, 60.0%, 58.5%). Processing times were approximately 2 min (ChatGPT), 5 min (Corvic via deployed workflow), and 9 min (Anthropic). The results demonstrate that workflow-level orchestration substantially outperforms direct single-pass LLM access on detailed engineering extraction tasks.

Keywords: P&ID extraction, XML digitisation, LLM benchmarking, process engineering, Corvic, Claude, ChatGPT.

1. INTRODUCTION

Piping and Instrumentation Diagrams (P&IDs) are the authoritative source documents for process plant design, encoding equipment identities, instrument loops, pipe specifications, valve positions, connection topology, and colour-coded subsystem annotations. Despite their importance, P&IDs are typically delivered as raster or vector PDF files drawn to proprietary conventions, making programmatic ingestion into engineering data systems a persistent bottleneck.

Recent advances in large language models (LLMs) have demonstrated broad capability in structured information extraction from complex visual and textual inputs. This study reports a controlled experiment in which three frontier AI systems were tasked with converting the same colour-annotated P&ID sheets into a structured XML format specified by the client. The experiment was designed to answer two practical engineering questions: (1) how accurately can current LLMs extract P&ID content into a fixed schema, and (2) how do the systems compare on extraction quality, completeness, and processing speed?

2.1 Source Documents

The test corpus comprised two colour-annotated P&ID sheets for Unit 300 of the Shipping and Transport College (STC) facility, drawn by Vicoma bv (document number 1002452, CAD file 1002452_D0002_sh2.dwg, revision D). The drawings depict a calcium carbonate (kalkmelk) recirculation and filtration system. Key equipment includes: E-301 (heat exchanger, 3.5 barg/250°C), C-301 (CQ degasser, 155×1850 mm), C-302 (CQ absorber, 155×1850 mm), V-311A/B (kalkmelk vat, 305×720 mm), S-301A/B (CaCO3 filter, 125×260 mm), and P-303A/B (circulation pumps, 2.5 dM3/min, 2 W). Colour highlights mark process subsystems: blue for major vessel groups, green for the kalkmelk circulation loop, red for LP steam/condensate lines, and yellow ISA-style instrument bubbles throughout.

2.2 Target XML Schema

Each system was required to produce output conforming to the following client-specified schema:

The schema requires spatial geometry (pixel-coordinate bounding boxes), source-target connection topology, ordered waypoints, instrument mapping, valve enumeration, line IDs, nozzle references, highlight annotations, and off-page connector tags.

2.3 Systems Under Evaluation

Table 1. System identifiers, model versions, and file naming conventions.

2.4 Evaluation Protocol

Each system received an identical prompt, the P&ID image and a reference XML template, in a single pass, with instructions to convert the drawing into separate downloadable XML files. Scoring was conducted blindly by two independent AI evaluators (Claude and GPT-4 class) with system identifiers withheld. Scores were assigned as estimated percentage ranges across 15 evaluation categories, with midpoints used as point estimates. Three top-level benchmark scores were derived: (i) Client Format Compliance, (ii) Equip.+Instr.+Piping, and (iii) Full Usable XML Quality.

One caveat applies to Sheet 1 ChatGPT: CPT-Colored1(1).xml was found to contain Sheet 2-style content. Sheet 1 ChatGPT scores are included for completeness but marked with asterisks and should be interpreted cautiously.

3. RESULTS

3.1 Top-Level Benchmark Scores

Summary scores for both sheets are shown in Figure 1 and tabulated in Table 2. Corvic ranked first on both sheets across all three benchmark categories. On Sheet 2 (the primary clean comparison), Corvic scored 91.0% client format compliance, 84.5% equip./instr./piping accuracy, and 82.0% full usable XML quality. The Equip.+Instr.+Piping gap was most pronounced: Corvic led by 16.5 pp over Anthropic and 24.5 pp over ChatGPT.

3.2 Granular Category Breakdown

Table 3. Detailed per-category scored ranges (%) for Sheet 2 (primary benchmark).

3.3 Coverage Profiles

3.4 Object Count Comparison

Table 4 reports raw object counts from Sheet 2 XML files. Corvic extracted 92 valve objects versus 9 (Anthropic) and 2 (ChatGPT), and 55 connections versus 18 and 11 respectively, reflecting substantially greater capture of pipe routing topology.

Table 4. Raw object counts from extracted XML files for Sheet 2.

3.5 Processing Time

3.6 Cross-Sheet Consistency and Average Performance

Figure 5 summarises cross-sheet averages and shows the Full Usable XML trajectory Sheet 1 to Sheet 2. Corvic was stable at 91.0% client format compliance on both sheets. Its Full Usable XML score improved from 79.0% to 82.0% (+3 pp). Anthropic improved from 60.0% to 66.0% (+6 pp), suggesting drawing-dependent performance. ChatGPT was stable at 58.5% on both sheets.

Table 5. Mean scores (%) averaged across Sheet 1 and Sheet 2.

4. DISCUSSION

4.1 Why Corvic Outperforms Direct Model Access

The most notable finding is that Corvic, sharing the same base model as the Anthropic system, substantially outperforms direct Claude access on all three benchmark categories. This isolates the contribution of the workflow layer: structured extraction prompting, schema enforcement, format post-processing, and iterative parsing stages collectively improve output completeness far beyond what a single-pass prompt delivers. The valve detection gap is illustrative: Corvic extracted 92 valves from Sheet 2 versus Anthropic’s 9, using the same underlying model.

4.2 Anthropic vs. ChatGPT

Anthropic ranked second with a different failure mode from ChatGPT. Anthropic produced cleaner bounding-box geometry and better from/to topology, suggesting a tendency toward structural fidelity over coverage breadth. ChatGPT produced equally readable XML structure but was consistently sparser, its valve detection on Sheet 2 (2–8%) effectively constitutes a failure on that sub-task. Neither system is production-ready for detailed P&ID extraction without substantial augmentation.

4.3 Processing Time Trade-offs

ChatGPT’s speed advantage (~2 min) is irrelevant given its quality shortfall. Corvic’s 5-minute pipeline time is practical for production and significantly faster than Claude’s 9 minutes. Crucially, Corvic’s time reflects a fully deployed, repeatable workflow, it can scale to multi-sheet drawings without per-sheet user intervention.

4.4 Limitations

Limitations include: (1) this benchmark uses two sheets from one facility; generalisation to other drawing styles requires further testing; (2) scoring was conducted by AI evaluators rather than human experts against a ground-truth XML file; (3) the Sheet 1 ChatGPT anomaly reduced clean comparative data points; (4) processing times are estimated and may vary with system load.

5. CONCLUSION

This study demonstrates that purpose-built AI workflows substantially outperform direct single-pass use of frontier LLMs on a detailed engineering extraction task. Corvic AI ranked first on both P&ID sheets across all three benchmark dimensions, achieving 80.5% average Full Usable XML quality, a 17.5 pp margin over the next best system. Valve detection, instrument mapping, piping recall, and highlight extraction are particularly challenging for direct prompt-based approaches. Workflow-level orchestration is currently the most effective mechanism for closing this gap.

For organizations seeking to digitze P&ID libraries at scale, deploying a structured AI workflow rather than relying on chat-based model access is likely to produce materially better output quality with comparable or superior processing throughput.

— -

Corvic AI Internal Benchmark Report — Unit 300 P&ID XML Extraction Study — June 2026

General

P&ID Extraction Benchmark Test

An LLM Without Memory Is Just a Very Expensive Prompt

You Don't Need a Graph Database — You Need a Graph. There's a Difference.

Corvic AI is Now Open to the Public — Request Pioneer Access