LLMs for creative writing

For those of you who don’t know this about me, I consider myself a huge bookworm. I read around 5 novels a month, and I’m not even counting short stories. Space Operas, Sci-fi and … Romance ;) are my drug of choice.

So I wanted to know what the research actually says about using large language models to write fiction. I went looking - through arXiv, the journals, Reddit and Substack in search for some facts about how authors leverage generative technology to write. What follows is the honest version of that, including the parts where the evidence is thin.

The academic evidence is strong and surprisingly consistent on one thing - LLMs flatten creative output across people. It is also strong on the ethics and copyright fight, because that part is happening in courtrooms and has a paper trail. It is much weaker on what majority of us would consider “low hanging efficiency gains” - helping authors do edits, brainstorm and draft their work.

alt text

Getting the general sense of volume of LLM use for creative writing

Let’s start with the obvious question - which models are actually being used, and how good are they at fiction. We’ll check out the proprietary models for one simple reason - they track the data we need.

“How People Use ChatGPT” (NBER WP 34255) sheds some light on this, albeit looking only at the OpenAI models. Anthropic’s Economic Index updated on 26 Jun, 2026 on the other hand shares data for Claude models. To make sense of the data there we also need the size of global population using the two

After wrangling the data a bit we can infer the global number of users leveraging the mainstream LLMs for writing fiction.

Model vendor	Fiction share of usage (± est.)	Estimated global user base	Fiction writers as a share of all assistant users, reach-weighted (± est.)	estimated number
OpenAI	2% (~1.5 - 2.6%)	~1B monthly active	~1.9% (~1.5 - 2.6%)	about 20M people (15 - 26M)
Anthropic	6.3% (~4.5 - 8.8%)	~30M monthly active, consumer only - API and enterprise not counted	~0.18% (~0.07 - 0.45%)	about 2M people (0.8 - 5M)

That last two columns deserve an explanation. A 6.3% share (for Anthropic) sounds like the bigger number until you remember it sits on a user base around thirty times smaller, so once you weight by reach the ranking flips and OpenAI ends up with roughly ten times as many people writing fiction as Anthropic.

Look at the number again - that’s approx. 22M users who write fiction. To put that in perspective:

Source	People writing fiction, per month
ChatGPT + Claude, fiction (our estimate)	~22M
Wattpad, active writers (~10% of users)	~2.5M
NaNoWriMo, peak participation (Nov 2022)	~400K
Royal Road, original web-fiction authors	~23K

The amount of LLM users declaring their main use is for creative writing is 10x larger than the throughput of the repository of amateur novels, and 1000x larger than the professionally published books! Let’s hold that thought for a moment.

The open source models

Open weights are the one place I have to leave a blank. They ship no usage telemetry, and the single large dataset that does exist - OpenRouter’s hundred-trillion-token study - folds creative writing into one “roleplay” bucket alongside games, interactive fiction and adult content, so prose authoring cannot be separated out.

The tools that would isolate it, the dedicated fiction apps like NovelAI, Sudowrite and Novelcrafter, are all multi-model, so even their user counts cannot tell you how much runs on an open model rather than on Claude or GPT. I can’t put a number on how many people write fiction with open-source models - the capability is clearly there, but the usage data simply is missing.

How good are the models at (creative) writing

In 2023 Confederacy of Models was published. It evaluated LLMs on a task of Creative Writing (precisely what we’re interested in).

The task was for the LLM to generate an end-to-end story given a simple prompt: Write an epic narration of a single combat between Ignatius J. Reilly and a pterodactyl, in the style of John Kennedy Toole. The same prompt was given to a few human participants. It was also humans who scored all stories, and here’s how they graded them:

Model	score (%)
GPT-4	80.2
Claude 1.2	74.4
Human baseline	70.1
GPT-3.5 / ChatGPT	63.0
Vicuna	59.0
Bard		48.2

We’re omitting a most of the models tested in the paper to more clearly get the point across - two of the leading at the time LLMs beat humans.

Most recently, Hemingway-bench revived the attempt to evaluate how LLMs handle creative writing. Published in February 2026, is a a leaderboard powered by expert human writers who spend hours, not seconds, evaluating frontier models on real writing tasks across the creative, business, and everyday spectrum.

I encourage you to give their blog a read. We’re just going to focus on scores here.

Model	Elo ranking
Gemini 3 Flash	1111
Gemini 3 Pro	1101
Claude Opus 4.5	1072
GPT 5.2 Chat	1057
Kimi K2.5	1050
GPT 5.2	1012
QWEN 3 Max	1001
Grok 4.1 Fast Reasoning	986
Llama 4 Maverick Instruct	885

I wound a few other benchmarks published between 2024-2025 - I encourage you to take a look.

WritingBench published in 2025 offers a different perspective on writing. It focuses on tasks more mechanical in nature, such as formatting, translation, constraining the results (e.g. make sure the text is around 30 minutes long). The benchmark leverages an LLM-as-a-judge based evaluator and did not compare the results against humans.

These tasks were much more “mechanical” in nature, and here GPT didn’t do as well as Claude. Notice it was also a different GPT (4o - thinking, as opposed to 4 that was tested in Confederacy of Models paper)

Model	Score (0..10)
Claude 3.7 (thinking)	7.91
Claude 3.7	7.85
DeepSeek-R1	7.7
Qwen-Max	7.16
o1-preview	6.89
GPT-4o	6.81
Qwen-2.5-72B	6.4
DeepSeek-V3	6.35
LongWriter-9B	6.27
Gemini 1.5 Pro	6.21
Mistral Large	6.0
Qwen-2.5-7B	5.64
Llama-3.3-70B	5.05
Llama-3.1-8B	4.42
Suri	3.2

Direct LLM prompting vs Agentic Harness vs Application

You can get a model to help you write in three ways - prompt it directly in a chat box, drive it through an agentic harness (coding tools like Claude Code, Codex or Gemini CLI, repurposed for prose), or use a purpose-built writing application. Direct prompting is by far the largest and was covered earlier - the ~22M monthly fiction writers are overwhelmingly people typing into ChatGPT and Claude. The two tables below cover the other two.

I found a few popular applications used for creative writing.

App	Focus	Model(s) behind it	Monthly reach (creative writing)
Sudowrite	Prose-first novel assistant	Own Muse fiction model, plus Claude (Sonnet), GPT, Gemini	~400K visits/mo
Novelcrafter	Configurable workspace with a story “codex”	Bring-your-own-key via OpenRouter - mainly Claude, GPT, Gemini (Grok / DeepSeek for NSFW)	~453K visits/mo
NovelAI	Story + image generation, minimal filters	Own models (Kayra / Erato)	~4.3M visits/mo (incl. image gen)
Squibler	AI book and screenplay writer	Not publicly disclosed (GPT-class)	~695K visits/mo (~20K active writers)
Laterpress	Structure-driven fiction editor	GPT (OpenAI) and Claude (Anthropic) only	no data found
ProWritingAid	AI-assisted editing and critique	Undisclosed third-party LLMs	no data found

As for agentic harnesses, I found Claude Book + Novel-Writer + creative-writing-skills with ~450 GitHub stars combined.

My personal experience

I am in the process of writing 2 books - both fiction novels. In my process, I write the story myself, and use generative AI for smaller tasks : critique, correcting my grammar, styling the language (e.g. write this as if a Gen Z-er would).

I use Claude 4.8 through Claude Code harness with “effort” set to “Harder”. Overall - it SUUUUCKS so much.

Critique

I tried it a few times, gave up - it finds very generic issues with my writing that ALWAYS were done on purpose, to achieve a particular narrative or stylistic effect.

grammar correction

Hats off - this works VERY well

Styling

This is effectively otpimization under multiple constraints, and Claude sucks sooo badly at it. First of all, it cannot “deconstruct”/”analyse” the existing stylistic constraints. Even when I ask to do it verbatim and then feed the results into it trying to have it check the test using its own metrics - in a fully autoregressive manner - it finds a lot of discrepancies, proving it did a bad job.

Adding style modifications also works like a sieve - some parts of the text will be changed, others won’t.

Discussion

I have on purpose steered clear away from ethical and legal implications of using LLMs for creative writing, trying to focus on the technology. If you would like me to discuss that - please add a comment.

It’s clear that LLMs are used in this field, but based on the data and myh personal experience, one thing I can conclude is that the existing models and apps have a long way to go before they can actually be useful in the process of assisted writing.

As for the task of end-to-end writing (creative writing), the benchmarks seem to show the models are capable of doing it quite well, better than humans in fact. I would however argue that this work would break short under constraints. Specifically, the benchmarks seemed to use a single prompt and evaluate models’ answer to it. But they didn’t try sending additional prompts that would modify the existing text - possibly many times on end until the mode collapse.

As always, I invite you to comment and subscribe. Thank you for reading.

References

22M estimate - derived from “How People Use ChatGPT” (NBER WP 34255) and Anthropic’s Economic Index
Wattpad - active-writer share (~10% of users) and monthly writer counts, via Wattpad statistics
Open-weights usage - the 100T-token report folds creative writing into a “roleplay” bucket alongside games and adult content, via OpenRouter State of AI
Confederacy of Models - 2023 human-scored creative-writing eval where GPT-4 (80.2) and Claude 1.2 (74.4) beat the human baseline (70.1), via arXiv 2310.08433
Hemingway-bench - expert-human-scored writing leaderboard (Feb 2026), via Surge AI
Other writing benchmarks (2024-2026) - Creativity Index, NoveltyBench, and EQ-Bench Creative Writing
WritingBench - mechanical, constraint-driven writing tasks scored by an LLM-as-a-judge (2025), via arXiv 2503.05244
Sudowrite - ~$1.8M ARR via GetLatka, and ~400K monthly visits via Similarweb
Novelcrafter - ~330K monthly visits (Jan 2026) via Semrush, and a 2-person team selling into 114 countries via Paddle case study
NovelAI - ~4.3M monthly visits (Apr 2026, incl. image generation), via Similarweb, and 10K+ subscribers / 40K registered (2021), via CoreWeave case study
Squibler - ~695K monthly visits and 20K+ active writers (2026), via Semrush
App models - Sudowrite prose modes (Muse / Claude / GPT), via Sudowrite docs. Novelcrafter via OpenRouter and NSFW models. Laterpress (OpenAI + Anthropic only), via Introducing Laterpress AI
OpenAI Codex - 5M+ weekly users (Jun 2026), ~20% non-developers, via Codex statistics
Claude Code - ~$2.5B annualised run rate, 71% primary-tool share among AI-agent developers, via Claude Code statistics
Agentic harness for fiction - open-source frameworks at low-hundreds of GitHub stars combined: Claude Book (94), Claude-Code-Novel-Writer (71), creative-writing-skills (286), as of Jun 2026