Fabian Roth
Apr 3, 2024
Building RAGs is fun, but even with AI, there is no free lunch: To get your system to high quality, work is required. We have outlined here some of the common problems that may occur, and how to detect and fix them.
Read More
The skill-set required to progress on these tasks is different from the one of the classic software engineer: The path through this space is akin to finding an optimal solution in physics or biology. There are some weak signals here and there, some reasoning frameworks to go by and a lot of measuring and experimentation. It feels much more like research than engineering. At Google, there is a special breed of engineers that do this kind of work: “Quality Engineers”: Their job is to build tools, develop metrics, collect and annotate evaluation and training sets, and finally, write a little bit of code, run an experiment and check if it improves the metrics on their datasets, then iterate. Like in research: Measurements and experimentation are the central pieces in this whole process. The builders of AI systems need to exchange their imperative coding style mindsets and tools with metrics and evals and adapt to the new process.
After having seen the potential of AI in the “It works!” demos, and realizing that there is a lot of work between the demo and the production system, we foresee that in 2024 companies are going to need to get started to climb that hill of AI accuracy. They should decide on their first use case, focus their resources and dedicate an engineering team to tackle the problem with determination. It will take more than a few hack-days. The process of hill-climbing accuracy can be arduous: It is an iterative process that requires deep analysis, perseverance, and love for the final product.
These are the essential elements needed to improve your AI system in a systematic manner
Evaluation sets are hand-tuned sets of examples of query / response pairs that demonstrate the ideal behavior of the system. Today, usually RAG or LLM application developers have their favorite set of a handful of queries that they anecdotally check. This is a dangerous whack-a-mole approach: while fixing one problem on one query, you may well be introducing a new problem on another query without noticing it. To systematically and deterministically improve a system, you will need a much larger evaluation set (or evalset). Collecting these evalsets is tedious work and requires close collaboration with the product owners. Inspiration should also be taken from actual usage logs. In a sense, if language is the medium, these evaluation sets define the product.
For a provided response in the evalset, metrics measure how far away the system is from the optimal response. Today, after running their favorite queries, developers are manually inspecting the new outputs and intuitively judging which is better. This results in noisy signals and biased decisions. Furthermore, manually checking results takes a long time and results in slow development cycles. Given a metric and an evalset, developers can get almost immediate feedback or even run grid-searches on training sets to find the optimal configuration for their system.
Once the quality of an experiment has been assessed using the metrics and evalsets, we need tools to inspect the wins and losses and allow us to observe the internal workings of the system and answer questions like: Why does this particular query not perform well anymore? Why was this document retrieved? Why did it disregard this fact? Why is it providing the answer in German?
Maybe it’s a bug in the code, or maybe it’s an edge case we need to include, or maybe it’s a silly thresholding of a signal that, if adapted, will trigger another problem elsewhere. Getting to the bottom of this will inform the next set of actions to improve on quality.
If engineers engage in this hill-climb, great things will happen: as the evaluation metrics are winding their way up, the usage metrics will follow soon after!
Here at fore ai such a toolchain is in the making. We call it “foresight” and it will offer a set of initial metrics and the functionality to build and curate your evalsets and track your system’s performance over time.