2024/05/15
Many of you have reached out to ask if we’ll be implementing the new models that OpenAI announced yesterday. To help answer that, we decided to share this blog post we’ve been working on to explain our process of evaluating new AI models for AI Dungeon.
Since the early days of Large Language Models (LLMs), we’ve worked hard to use the most advanced models in the world for AI Dungeon. We’ve seen incredible advances in the power of these models, especially in the past 6 months. We’d like to share more about how we think about AI Models at AI Dungeon, including the entire lifecycle of selection, evaluation, deployment, and retirement. This should answer some questions we’ve seen in the community about the decisions we make and what you can expect from models in AI Dungeon in the future.
AI Dungeon was born when our founder, Nick Walton, saw the launch of OpenAI’s GPT2 model and wondered if it could become a dynamic storyteller (just like in Dungeons and Dragons).
Spoiler: it worked! 🎉
A hackathon prototype turned into an infinite AI-powered game unlike anything before. From the very first few days, the cost of running an AI-powered game became readily apparent. The first version of AI Dungeon cost $10,000/day to run (so much that the university hosting the first version had to shut it down after 3 days!). Thus began our constant quest to identify and implement affordable and capable AI models so that anyone could play AI Dungeon.
The first public version of AI Dungeon (in December 2019) was powered by GPT-2. Later, we switched to using GPT-3 through OpenAI (in 2020). While it was exciting to be using the state-of-the-art AI tech at that time, unlike today, there were essentially no other competitive AI models, commercial or open source. We couldn’t just switch models if issues arose (and they certainly did). When you asked us for cheaper options or unlimited play, we didn’t have the leverage to advocate for lower costs for you since there was no competition creating price pressure.
But that was all about to change. New open-source and commercial models entered the market, and we explored them as they became available. The open-source GPT-J (summer of 2021) and GPT-NeoX were promising, and AI21’s Jurassic models (Fall of 2021) were explored over the next few years. Fast-forward to today, there are hundreds of models and model variants. AI Dungeon is uniquely positioned to leverage new advances in AI from various providers at scale.
Given the number of available models, picking which models to check out can be tricky. Evaluating and deploying models takes time. Here are some of the ways we think about this process:
We’ve made a few choices that impact how we handle AI model work in AI Dungeon. Together, these allow us to give you the best role-play experience we can.
At first, we evaluated every model that launched. Early providers included OpenAI, Cohere, AI21, and Eleuther. Lately, we haven’t been able to keep up with the rate of new models being launched. Here’s one visualization of just how the AI Model space has accelerated.
We’re selective about which models to evaluate. We base that decision on information we source from the AI community on X, LLM leaderboards, our technology partners, and members of our AI Dungeon community.
When a model piques our interest and seems like it could be worth exploring (when it could have a desirable combo of cost/latency/quality/etc), we do some light exploration around feasibility and desirability. If there’s a playground where we can test the model, we’ll play around a bit ourselves to see what we think. We also talk to our current providers to see how/when they may offer a model at scale.
If everything seems positive, we move into our model evaluation process.
Once we’ve identified a model we are interested in, then the real evaluation starts. Here are the steps we take to verify if a model is worth offering to you in AI Dungeon:
At any step of the process, we may decide to stop evaluation. Most models don’t make it through our evaluation process to become an offered model on AI Dungeon.
Once we commit to offering a model on AI Dungeon, we then figure out the best way to run it at scale. With private models we often can only run them with the creator of the model (like AI21’s models). For open-source models, we can choose between running on rented hardware or using other providers that run LLMs as a service (which is our preference). By optimizing our model deployment costs we’re able to deliver better AI to users for the same price.
We also have an alert system and series of dashboards that show us the number of requests, average context in and out, latency profiling (average request time, max request time), and estimated cost. This lets us keep our AI models running smoothly and quickly respond to any issues that come up.
Given the complexity of models, it’s sometimes necessary to retire models that are no longer adding much value to the community. While it would be nice to offer every AI model perpetually, maintaining models takes time and development resources away from other improvements on AI Dungeon, including new AI models and systems.
Because of that, we need to balance the value a model provides against the cost of maintaining it (especially in developer time). We’re guessing most of you are no longer pining for the good old days of GPT-2 😉.
Before deciding to retire a model, we consider usage, tech advances (i.e., instruction-based models), latency, uptime, stability, error rates, costs, player feedback, and the general state of models in AI Dungeon (i.e., how many do we have for each tier).
Each model is unique, like an ice cream flavor. Taking away your favorite flavor can be frustrating, especially if that model does things that other models don’t (like mid-sentence completion). We hope there’s solace in the fact that when models are retired, the recovered development resources are reinvested into better models and new features that make AI Dungeon a better experience for you.
Today here’s the % breakdown of model usage for various models:
MythoMax 73%
TieFighter 17.8%
Mixtral 8.8%
Griffin 0.4%
Mixtral 79%
MythoMax 8%
WizardLM 8x22B 5%
TieFighter 4%
Llama 3 70B 2%
Dragon 1%
GPT-4-Turbo 0.5%
ChatGPT 0.4%
Griffin 0.01%
You’ll notice a few things. MythoMax is our most popular model, even capturing some use from paid players who have access to all the models. Mixtral is the clear favorite for premium players.
Because of the advances in tech as well as low usage, we will be retiring Griffin, Dragon, and ChatGPT models on May 31st, 2024. Griffin, while it’s served us well, has exceptionally low usage, the worst uptime of all our models, and a high rate of errors. It requires more developer maintenance than all other models we offer. Dragon and ChatGPT also have lower usage now. Retiring models enables us to focus on other product work including additional model improvements, bug fixing, and building new features.
GPT-4-Turbo is somewhat of an outlier. Despite its moralizing, it’s one of the best story writing models available. Players who use it love it! While its usage rate is low relative to other models, it’s actually well represented for a model only available to Legend and Mythic tiers, though it’s clear players still favor Llama 3 70B and WizardLM 8x22B. We are evaluating the recently announced GPT-4o as a potential replacement for GPT-4 Turbo which could mean offering higher context lengths. Although venture-funded OpenAI says they’ll offer limited use of GPT-4o for free through their own ChatGPT client, it will not be a free model for API users (like AI Dungeon), so it will still be a premium model for us. First, though, it needs to pass the evaluations we’ve outlined above.
This was a deeper peek into our approach to models than we’ve ever given. We hope it’s clear that we spend a lot of time thinking about which models we can offer to you and how to provide them best.
Thank you to all who have given feedback on our AI models. We will continue to communicate as much as we can about models and planned model improvements. It’s exciting to realize AI will only get better from here. The past few months have shown us just how fast things can change. And we’re excited to explore with you how much better role play can be as AI keeps improving.