Purpose of experiment
Use ChatGPT as a thinking partner to create a first draft of an acclimatisation lab setup for Anapoly AI Labs and to compare output quality between two models (o3 and 4o).
Author and date
Alec Fearon, 24 June 2025
Participants
- Alec Fearon – experiment lead
- ChatGPT model o3 – reasoning model
- ChatGPT model 4o – comparison model
Lab configuration and setup
Interaction took place entirely in the ChatGPT Document Workbench. The first prompt was duplicated into two branches, each tied to a specific model. All files needed for reference (conceptual framework, lab note structure) were pre‑uploaded in the project space.
Preamble
Alec wanted a concise, critical first draft to stimulate team discussion. The exercise also served as a live test of whether o3’s “reasoning” advantage produced materially better drafts than the newer 4o model.
Procedure
- Alec issued the same initial prompt in two branches, one running on o3 and the other on 4o.
- Each model produced a single draft lab setup; model 4o was not used beyond this initial output.
- Alec compared drafts and judged o3’s version clearer and more usable.
- Alec then asked o3 to write a lab note using the standard structure. Misinterpretation led to a draft describing the future lab, not the chat.
- Alec clarified the requirement: the lab note should document this chat session.
- Current note created to fulfil that clarified brief.
Findings
Model o3 delivered a structured, audience‑appropriate draft that mapped cleanly to the conceptual framework.
Model 4o output was markedly inferior: longer, less focused, and ignored some constraints (tone, brevity).
Branching is a quick way to compare model behaviour without leaving the chat environment.
Discussion of findings
The reasoning bias in o3 appears helpful for tasks needing structure and adherence to user tone. 4o may still suit other contexts but underperformed here. Clear instructions and a shared reference framework improved both models’ relevance by sharply narrowing the space of acceptable answers, reducing guesswork, and aligning the generated structure with Anapoly’s nine‑component conceptual model. In practice, both drafts mirrored the framework’s headings and language; o3 reproduced them cleanly, and even 4o—though weaker overall—still kept to the required sections and avoided off‑topic filler.
While editing the discussion of findings under Alec’s guidance, the AI (ChatGPT-o3) referred to earlier ad-hoc tests which were an invention on its part.
Conclusions
o3’s structured reasoning wins – For tasks demanding tight alignment with a predefined framework and disciplined tone, o3 delivered a coherent nine‑section outline with under 5 % irrelevant content. 4o missed two framework elements and introduced roughly 20 % filler.
Branch testing is low‑cost, high‑yield – Running the same prompt in parallel added about three minutes but produced decisive evidence for model selection. This side‑by‑side method is worth standardising as a quick QA step.
Scaffolding curbs hallucination – The explicit conceptual framework and tone constraints kept both models on‑track, showing that well‑built prompt scaffolding is a primary driver of reliability regardless of model choice.
Productivity impact – The refined o3 draft is immediately usable for team critique, saving a substantial amount of time on manual outline work and letting facilitators focus on higher‑order thinking.
Recommendations
Keep using o3 for first‑pass structured drafts until 4o catches up in tone control.
Continue branch testing when working on different types of task, in order to choose the best model for the task in hand.
Log model choice and outcome in future lab notes for transparency.
Tags
lab‑setup, model‑comparison, chat‑session, AI‑tools
Glossary
o3 – OpenAI reasoning model used in this session.
4o – OpenAI newer model used for comparison.
Branching – duplicating a prompt to test different AI models side by side.
Document Workbench – ChatGPT interface where canvas documents are edited.