Ouroboros LLMs and their impending entropy problem


Ask me what my principle concern is with large language models and procedural art generators, and I’ll say attribution. I’ve mentioned it a few times, but I find it galling that commercial AIs require data to be trained against, yet the creators of said data are never asked, acknowledged, or compensated. Ask why, and you’re met with silence, contempt, or dubious comparisons made to the camera, or something.

But for those who care not for social responsibility, another issue looms. I talked with a few of you on Mastodon early this year about LLMs feeding on the output of other LLMs, and what effect that might have on their quality. I’m starting to see more discussion around this, and it raises interesting questions about the tech.

The first batch of plagiarism-as-a-service tools were trained against human-generated data. Granted, there’s always been procedurally-generated stuff on the Internet, but it was probably easy enough to filter out. Mediocre but plausibly-human sounding chatbot output now abounds, and it’s only a matter of time before it constitutes the bulk of the modern web. It’s a dim thing to be excited about, but don’t tell that to the latest shipload of charlatans who gave up peddling blockchained tulips.

More to the point though, as any cryptographer will tell you, this reduction entropy is a disaster if your model requires original thought.

This leaves LLM engineers in a bit of a pickle. Either they continue training against data created in the before times, which will lead to a reduction in timeliness and relevance; or they work out a way to tell human creativity and LLM pollution apart. Think of an electric air filter powered by a coal plant: the more you burn, the more filters you need, which will burn more, and so on.

This arms race will be won by the company that can deliver tech to tell humans and LLM output apart. Whether that ends up being possible, it may be moot for another reason.

What dismays me about these tools, besides the obvious ethical questions, is that they’re actively usurping creativity. You could argue that there’s not much loss in replacing human-generated boilerplate with the output of an LLM, but they’re already pushing out human art and culture at a depressing clip. Technical talking heads are oddly proud to admit they can’t tell human and generated material apart; or as I like to call it, the I can’t believe it’s not butter defence. As long as uncritical people like that exist, expect more creative people to be sidelined as generated material is deemed sufficient.

You’d have to be naive to the point of being an LLM advocate to think this won’t have a perpetual chilling effect on the arts, writing, and other creative endeavours. This is a literal tragedy of the commons, and in an ironic twist, will end up further accelerating the widening ratio of generated to human content.

I could be wrong; there may always be enough dreams, love, and creativity to exploit at a financially-viable scale. But it’d certainly be karma for these LLMs to choke on the exhaust of their own hubris eventually.

Author bio and support


Ruben Schade is a technical writer and infrastructure architect in Sydney, Australia who refers to himself in the third person. Hi!

The site is powered by Hugo, FreeBSD, and OpenZFS on OrionVM, everyone’s favourite bespoke cloud infrastructure provider.

If you found this post helpful or entertaining, you can shout me a coffee or send a comment. Thanks ☺️.