It’s been more than two years since the release of ChatGPT ushered in the era of widespread use of Large Language Models (LLMs). When I was working in Natural Language Processing in 2018 I used to think things were moving fast because there seemed to be big releases every month – now it feels like they are nearly every day!
As we close out 2024, I wanted to step back and reflect on the developments that I think will have the biggest impact in the years ahead. I’ll touch on:
- Multimodal models: the ability for models to take in not just text, but images, video, and audio – and output in those formats as well
- Long context models: models can now consider more information at a time
- Agents: the ability for models to use tools and take independent action to accomplish a task
- Advances in AI Code Assistants: AI Code Assistants do so much more than just answer questions about code – they can now create working prototypes with little guidance.
- Computer use: the ability for LLMs to interact with computers and the web like people do
While there were many other developments that are noteworthy, these are the ones that I think will have the broadest impact in the coming months.
Multimodal models
Data comes in many forms, so when ChatGPT was originally released and could only handle text people had to create workarounds. No more – the leading LLMs can all handle a variety of inputs, making them much more useful to end users. In many cases you don’t even need to use OCR to extract text from images or documents – the models handle typed and handwritten text surprisingly well.
The impact of this is that you can upload documents, images, or video and ask questions directly. For example, if you want to find out from a video “what the man in the red shirt did” you will get a response back from the model.
OpenAI’s Advanced Voice Mode, which the user speaks to and gets responses (in an often-eerily-human voice), opens new ways of interacting with the models. I use it for brainstorming sessions, but my daughter loves “alternating stories” where she and ChatGPT take turns moving the plot of a story along (if you have a kid around in the 6-10 range I suggest trying it out!).
Long-Context Models
The amount of information that a LLM can consider at one time is called the “context window”. When GPT-4 was released in 2023 it could only work with a few thousand words at a time. As powerful as the model was, this short context limited its usefulness. Since then context windows have increased to over 2 million tokens in Google Gemini Pro 1.5, a factor of 250x in a year. Open-source models are not far behind.
This enhancement enables new tasks to be completed quickly and cheaply by LLMs. For example, with Google Gemini Pro 1.5 a user can upload hundreds of pages of documents and ask for summaries and insights. I recently uploaded the National Intelligence annual reports from 2015, 2019, and 2023 to Gemini. In about 30 seconds, I not only had summaries of the key points from each year, but also insight into what had changed in the reports from 2015 to 2023 – something that would have taken me hours to do by myself. The implications of this for intelligence analysis are large.
If you’re looking for feedback on a workflow, you can record a video and ask the LLM for feedback on how to improve. Lastly, this longer context will also enable models to recall more of a conversation with a user, allowing for greater personalization across a range of applications.
Agents
I’d be remiss if I didn’t talk about agents, which seems to be the trendiest AI term of the latter half of 2024. Despite its popularity, people seem to have trouble agreeing on a definition. For this conversation, I’ll define agents as LLMs can operate independently for some period of time to accomplish a goal.
Agents enable LLMs to specialize in given tasks, greatly improving their quality. For example, if you ask a single model to write a story, it will do a decent job. However, if you have a team of agents: one for planning, another for character development, and yet others for writing and editing, you’ll get much richer results.
While companies are still figuring out how to make the best use of agents, they are starting to be integrated in a range of software, from Microsoft Office to code editors.
AI Code Assistants
AI Code Assistants now do so much more than answer questions or make simple suggestions. As much as I’ve been using GenAI offerings from a range of companies over the years, the recent releases from companies including Cursor, Codeium, Replit, and Microsoft make me feel like I’m using some magical incantation to create working code.
In minutes people can create working apps, using just natural language, that lead to real productivity gains. If you have some programming experience and a little more time, you can create nearly production-ready apps. In my experience I’m easily 3x-5x more productive than I otherwise would be, especially since I’m not coding daily.
Computer Use
Until recently, LLMs have been restricted to chat apps and interacting with tools via APIs. However, Anthropic (one of OpenAI’s biggest competitors), announced computer use in November 2024. It’s rumored that OpenAI will announce similar features soon and many startups are working on this.
Computer use enables LLMs to interact with computers and the web like a human would, by moving the mouse and typing in search boxes. This greatly increases their possible utility, but also the corresponding risk. When combined with agents, an LLM enabled with computer use could set off “on its own” to perform tasks on the web, such as going to a search engine to search for a document, download the document, summarize it for you, and create a visualization to explain it.
People are just beginning to explore how to make use of this new functionality in a safe way, but it opens a world of possibilities as we look ahead to 2025.
Conclusion
At Redhorse we are exploring how to use these tools to improve our internal operations as well as better support our customers. By using these tools on a daily basis, we are able to provide better advice and real solutions, ranging from more insightful analytics to more efficient software development.
In a future blog post I’ll turn my attention to what I think 2025 holds for AI and the use cases those developments will enable.