Over the past month, we have talked to more than 30 RAG builders in many different verticals from all over the world about their use cases and pain points. Here is what we learned.
As companies are exploring the potential of AI for their enterprises they are focused on building RAGs on top of their own data. The main use cases of the builders we talked to are Expert Chatbots, i.e. a chatbots that are specialized to provide information on an expert domain. Those promise to make their employees more efficient by expediting their information seeking tasks.
The second most important use case was using LLMs not as a conversational UI, but to help solve one specific task of the employees, like for example automatically distilling reports with a specific structure from a basket of input sources.
Other use cases are extracting structured information from documents, and website chatbots for the customers and visitors that provide high-level information sourced from the website.
More than half of the builders we talked to mentioned that quality is a big issue and potentially a show-stopper for their launch. The ones for whom quality was not top of mind were all working on informational web chat-agents for which “nobody expects to always give the right answer”.
When asked about specific pain points, Retrieval and Chunking was mentioned most often. Developers struggle a lot with finding the right strategy to chunk up the data, add metadata, and finally get the relevant chunks retrieved from the query. It’s a tedious task and often it’s unclear what’s needed to solve the problem: Maybe the most relevant chunk to answer the question is not even close to the query in embedding, or to answer a particular question, reasoning across many different chunks will be necessary. Or the relevant context is cut off in the chunking and the LLM is not able to answer the question correctly. Successful developers are designing their own, data specific chunking algorithms and annotate the chunks with extracted metadata. None of the people we've talked to has attempted to fine-tune their own embeddings, but many are reporting that over-retrieving and reranking helps.
The second most important pain point was Evaluations. Developers spend a significant amount of time manually evaluating the output of their RAGs for different configurations, and debugging errors by hand, often anecdotally using a few queries. This makes their experiment and development cycles error-prone, arduous and slow. They are looking for tooling that helps them systematically evaluate their system so they can run several experiments simultaneously and do an informed hillclimb on the output quality. Like foresight ;)
The third pain point was Input Parsing: A lot of data comes in the form of PDFs, or tables or infographics. This information needs to be made available to the LLM so it can reason about it. One example of this is a column-chart with yearly revenue bars. But even simpler things like extracting titles and sections in documents to augment the chunks - so they have the right context - are hard to solve. The bar on image understanding or structure extraction from docs is much lower today than it used to be. LLMs are able to gracefully handle semi structured output. They don’t require e.g. end-to-end key-value or json extraction. But still, devs are struggling with bad parses that confuse the generative part of their RAG.
Prompting came up at #4. It is a pain, but developers are accepting it and have found ways on how to deal with it. Not many of them spend too much time on the prompts themselves. Usually they start with an initial prompt, and then quickly realize that fixing the issues mentioned above promises a much higher quality gain to the overall system than spending time on gold-plating the prompt.
With the rise of larger and better aligned models, hallucinations are not one of the main concerns anymore. The new large models behave well and follow the prompts quite rigorously. So when prompted only to use the context, they obediently do so.
Of the 30 odd people we’ve talked to that are building RAGs, only 2 of them have a RAG in live production. Both of these companies have realized that building an LLM app at production grade will require significant engineering and quality work and have been willing to invest this work early on.
Most of the other builders are still in the exploration phase of creating prototypes, some are being tested by a select small group of users, others have already been “paused”. Many of them are still trying to find a good trade-off between quality, value and cost. The stake-holders that asked for building these systems are cautious of launching an AI publicly, or even internally, that might make more news about its embarrassing behavior than the time it has saved for its users.
Some are now testing in beta with a slightly larger audience, but their hand is still steady on the handbrake in case things go south. The set of apps in this bucket are mostly Question Answering / Informational chatbots on websites. For those, the quality bar is usually not as high as for expert systems that threaten to replace humans which will make sure to scrutinize them thoroughly before accepting them.
There is no doubt that LLMs are going to get employed more and more to automate processes and help with information seeking tasks. But its adoption is going slower than expected because, despite the amazing capabilities of LLMs themselves, for RAGs, quite a bit of work is still required to get this right. Specifically, easier ways to aid developers with chunking and retrieval need to be developed. Additionally, the decade old problem of parsing input enters a new level today as the demand for processing of unstructured input is propelled by the adoption of AI on proprietary data.
With the LLM we have a great motor, but we can only use it if we build a good car around it.