I was recently invited by kaiko.ai to talk about evaluation of LLMs in the Oncology / Multimodal applications track of AMLD 24 in Lausanne. In my talk, I followed the process of building a real RAG with information about cancer and outlined all the common problems - and how to fix them - throughout the process. Given the positive feedback from the audience, I decided to share the whole process with the readers of the fore ai blog.
Presentation at AMLD 24 in Lausanne.
When ChatGPT was released one and a half years ago, the potential of large language models was immediately apparent to everyone. Despite its ease of use, one and a half years later we still don’t encounter many specialized LLM based products in the wild.
Why is that?
Building LLM apps is easy
The LLMs out there are generic enough that they can answer specific questions when given the right context and prompt, without any fine tuning necessary. The nitty gritty details of model training and hyperparameter tuning has been abstracted away nicely in developer offerings like LlamaIndex, LangChain and many more.
Today, building an expert for your own data takes literally 5 lines of code:
Building LLM apps is easy hard
Behind the scenes, a RAG (Retrieval Augmented Generation) is constructed. The abstraction layer reads the documents from the directory, chunks them up and indexes them using an embedding API. The index is passed to a query engine, which at inference time, will embed the query, do a nearest neighbor lookup in the index, retrieve the documents and put them in a prompt to be sent off to an LLM API which will then generate the final answer.
Easy enough.
There’s a lot of complexity in this underlying layer, and lots of little things that can go wrong and pile up until the final answer is presented to the user. Unfortunately, the classic engineering toolchain with debuggers and testing fails to help in this process. New tools are needed, I will describe what they look like in this post.
Case Study: A RAG that knows about Cancer
To demonstrate one potential application of LLMs in the oncology sector, I decided to quickly build a RAG that was fed with basic information about cancer types, symptoms and risk-factors. This isn’t meant to be used for medical purposes, just a quick shot from the hip to see how well we can do by building a RAG from off-the-shelf components; but also, to show how hard it is to build a good RAG.
I scraped wikipedia’s List of cancer types and Category:Cancer pages with recursion level 1 and dumped all docs into a RAG using the code shown above. It’s amazing how easy and quickly it is to build a system like that.
Most answers are pretty good, but not all:
Example: Instabilities from Questions:
Here’s an example of how a small change in the query can give rise to very different responses:
Adding just the question mark at the end, changes the response. While exchanging 12% with approximately 12% may not sound like a big difference, it changes the perception of the confidence of the underlying data. This is likely an unwanted side effect of this small perturbation of the input.
Those diagnosed vs those who survive one year after diagnosis is a much worse loss as we shall see below.
Example: Anecdotal Retrieval
Semantically equivalent queries may lead slightly different embeddings, which in turn, can lead to a different set of documents that are retrieved. We call this ‘Anecdotal Retrieval’, docs are retrieved like a few random anecdotes, rather than comprehensively. As a consequence, the LLM does not have access to all relevant documents, but will emit a response with high confidence nonetheless:
In this example, two different sets of docs (though with a large overlap) are retrieved from the two semantically equivalent queries. This in turn, causes the LLM to reply with a different response: ‘Ovarian cancer’ is omitted from the second response.
Example: Takin’ it easy with the nuances
Nuances can be quite important, and a small reformulation can lead to a semantically very different meaning. Let’s revisit the example from above:
The baseline for the 12% is ‘diagnosed’, and not ‘have survived one year after diagnosis’, hence the survival rate for those who survive one year after diagnosis is more like 50%. This sort of error is really bad and needs to be detected and handled.
So how do we fix it?
A RAG or any other LLM app has many hyper parameters which can be tweaked (The knobs in the RAG box above): Chunk size, embedding model, ranker, prompt, …. There is usually no explicit way to find the optimal parameters, so these need to be found iteratively, similar to how the weights are determined via gradient descent in the underlying models.
The process is roughly:
- Build an evaluation set
- Run an evaluation of your RAG
- Compute scores for the output of your RAG
- Look at worst offenders and determine the headroom
- Implement a fix for your headroom and tweak your parameters
- Iterate: Go to step 2
Build an eval set
Your eval set is a set of questions and their associated ground truth answers (sometimes questions alone suffice as well). This is the most important thing, you should be prepared to spend some time on this, because this will define the final quality of your LLM app.
Make sure the eval set covers the whole scope of your product, this means e.g. it should contain generic as well as detailed questions, and probe questions in about different parts of your corpus. Remember: You only have eyes on your system where your eval set is. Ideally, such a set would be upwards of 100 question-answer pairs.
It can be quite tedious to create these eval sets, so there are some tricks to bootstrap your set:
- You can let GPT4 or any other large model out there create a question answer set automatically from your corpus.
- If your system is live already, you can sample your set from live traffic - this is most useful because it 100% reflects the usage of your product. It has the downside that it doesn’t contain ground truth answers though.
Run an evaluation and compute metrics for your RAG
In the next step, the RAG is run against the eval set, and every candidate answer from the set is compared to the reference answer, or checked against other metrics. Here’s a list of metrics that are useful:
For example, Groundedness checks whether the answer is based on the corpus and the corpus alone.
Here it is at work:
In this example, the context does not contain any information about the risk factors of Head and Neck cancer directly, but it talks about the risk factors for the more special type of $Laryngeal cancer. Since it contains the fact that Laryngeal cancer is a subtype of Head and Neck cancer, the risk factors listed for the former are also risk factors for the latter. Hence, the metric attributes a high score for groundedness to these risk factors because they are entailed in the context. Conversely, the context does not talk about heavy smoking, which may be intuitively correct, but since it is not mentioned in the corpus it will get a low score.
Find the headroom and tweak your system
Once we have a report of all Question-Answer pairs and their corresponding scores for the metrics we care about, we can sort them by scores and look at the top performing and worst offending queries. The main idea here is to find clusters of systematic issues, rather than edge-cases. Such a systematic issue could for example be:
- Most queries in spanish are broken because irrelevant documents are retrieved
- Many responses include additional practical information that is not explicitly mentioned in the corpus
- Some responses are answering with irrelevant information
From each of these issues you can derive a hypothesis about what is causing it and tweak your system to mitigate the problem. E.g.
- Broken spanish queries b/c retrieval => Your embedding is not multi-lingual
- Additional non-grounded information in the answer => Prompt is not authoritarian enough about only using the provided context for the response.
- Irrelevant information in the output => Too many non relevant documents retrieved, improve embedding, or retrieve fewer documents.
Today, these metrics are implemented using LLMs themselves or specialized smaller models.
Iterate
Once you have your fixed candidate, run a new eval and watch your metrics go up and to the right!
It is advisable to continuously run metrics on your live traffic. Many things may happen over time, like e.g. your embedding provider is changing the underlying model and your index doesn’t match anymore, or your query mix changes, or …
If you run metrics on a daily basis, you will notice issues very quickly!
Conclusion
Building RAGs is fun, but even with AI, there is no free lunch: To get your system to high quality, work is required. We have outlined here some of the common problems that may occur, and how to detect and fix them. We are building a toolsuite to help you with all of the above steps, please sign up for beta testing on foreai.co.
Sign up for the closed beta if you’d like to try out fore ai's product foresight 🚀