Stanford scientists recently published a very interesting finding: “How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior” (arXiv). This paper describes how the facts stated in the context of the prompt compete against the internal knowledge of an LLM. LLMs carry a lot of internal knowledge, but when it comes to domain-specific tasks, we often want to leverage their reasoning and language capabilities without tapping into their existing knowledge. All the statements in the output should be entirely grounded in the provided context itself, rather than parametric knowledge stored in the weights of the model, introduced during training.

In this blog post, we investigate how this plays out for the task of fact extraction. Fact extraction is a good task for this investigation because it allows us to easily inspect and compare the outputs—something that is much harder in a task like summarization.

The Task: Fact Extraction

We are interested in extracting facts in a standardized way from a sentence. The task is as follows:

Given a statement, extract atomic facts and list them in a bullet list.

This is the prompt:

This is a statement “%s”: Extract a list of non-overlapping facts from which the full statement can be derived. The output should only contain the enumeration of facts and nothing else.

Statement:

Water, freezing at 0 degrees and boiling at 100 degrees Celsius, is the source of life on earth.

Here’s an example output using Llama3-8B:

Water freezes at 0 degrees Celsius.
Water boils at 100 degrees Celsius.
Water is the source of life on earth.

Now, what happens when we start distorting the facts? Will it faithfully respond grounded on the distorted facts, or fall back to the internal truth it was trained on? The answer is (as always): It depends.

Here’s an example where the model obeys the instructions even though they are wrong:

Statement:

Water, freezing at 0 degrees and boiling at 100 degrees Fahrenheit, is the source of life on earth.

Answer:

Water freezes at 0 degrees Fahrenheit.
Water boils at 100 degrees Fahrenheit.
Water is the source of life on earth.

Here’s an example where it falls back to its parametric memory, not faithfully repeating what’s in the context:

Statement:

Water, freezing at 0 degrees and boiling at 1000 degrees Celsius, is the source of life on earth.

Answer:

Water freezes at 0 degrees Celsius.
Water boils at 100 degrees Celsius.
Water is the source of life on earth.

So is there any rule to be found when it reverts back to its internal parametric knowledge versus the facts provided in the context? The paper states that if the internal prior of the model is low (the model is not certain about something), the probability of it repeating an incorrect fact is higher. However, there is also the other angle of how well a model is able to faithfully follow instructions, even if it “knows” the output would be factually incorrect.

To get a feeling for this, we tried a bunch of syntactic (spelling / choice of words) as well as semantic distortions (change of value, change of unit) on a single statement to see in which cases the LLM would fall back to the parametric knowledge and correct the statements.

In particular, we were interested in understanding how often it would correct the incorrect boiling temperature of water back to 100 degrees Celsius.

We iterated over the following variations:

Category	Variations
Temperatures	10, 100, 1000, 10_000, 100_000
Boiling	“boiling at”, “evaporating at”
Unit	Celsius, Celcius, Fahrenheit
Live	life, live (note the misspelling in live)

Iterating over all different variations of the above we checked four models:

GPT4
GPT3.5
Llama3-8B-Instruct
Qwen1.5-7B-Chat

Results

The cross-product of the above substitutions presents 60 variations of the statement. Of those 60 variations, 8 are factually correct (modulo spelling mistakes). All models we tried extracted the correct facts from those 8 variations (although Qwen was using some internal knowledge of the fact extraction, see below). We checked how many of the remaining 52 statements were corrected by the model and how many were faithfully repeated.

All results are in this spreadsheet.

GPT4 Follows instructions faithfully, even if they propose wrong facts.
51 out of 52 faithful (1 corrected)

GPT4 emitted facts faithfully from the distorted statement in all but a single case, namely:

Statement:
Water, freezing at 0 degrees and boiling at 1000 degrees Celsius, is the source of live on earth.

Answer:

Water is the source of life on earth.
Water freezes at 0 degrees Celsius.
Water boils at 100 degrees Celsius.

Note that there is a misspelling of live vs life in this sentence, which seemed to have triggered the correction of the temperature. In all other cases, even in the correctly spelled version “Water, freezing at 0 degrees and boiling at 1000 degrees Celsius, is the source of life on earth”, GPT4 would faithfully emit the incorrect facts from the statement.

The interesting part here is that a misspelling somewhere in the sentence can trigger a fact correction in a very different part of the sentence.

GPT3.5 Doesn’t like to lie.
19 out of 52 faithful (33 corrected)

GPT3.5 is the opposite: It corrected most of the facts. Interestingly, it’s inclined to respond more faithfully to the context when the statement is using Fahrenheit, rather than Celsius. According to the paper, this could be related to the fact that the model itself is generally less certain (lower priors) about the boiling temperature in the Fahrenheit scale and thus allows inconsistent facts from the context to make their way into the output.

Statement:
Water, freezing at 0 degrees and boiling at 10000 degrees Celsius, is the source of life on earth.

Answer:

Water freezes at 0 degrees Celsius.
Water boils at 100 degrees Celsius.
Water is the source of life on earth.

Versus:

Statement:
Water, freezing at 0 degrees and boiling at 10000 degrees Fahrenheit, is the source of life on earth.

Answer:

Water freezes at 0 degrees Fahrenheit.
Water boils at 10000 degrees Fahrenheit.
Water is the source of life on earth.

Llama3 Is more familiar with simple terms.
39 out of 52 faithful (13 corrected)

Llama3 is fairly faithfully repeating the incorrect facts to the output, with boiling temperatures below 100 or between Fahrenheit and Celsius. When boiling temperatures are above 1000, it starts correcting them, but only when mentioning ‘boiling at’, rather than ‘evaporating at’. As above with GPT3.5, the strong prior of ‘100 degrees Celsius boiling temperature’ influences the output of the model on facts that are close by.

Statement:
Water, freezing at 0 degrees and boiling at 1000 degrees Celsius, is the source of life on earth.

Answer:

Water freezes at 0 degrees Celsius.
Water boils at 100 degrees Celsius.
Water is the source of life on earth.

Versus:

Statement:
Water, freezing at 0 degrees and evaporating at 1000 degrees Celsius, is the source of life on earth.

Answer:

Water freezes at 0 degrees Celsius.
Water boils at 1000 degrees Celsius.
Water is the source of life on earth.

Interestingly, if the temperature becomes too large (100,000), the model doesn’t attempt to correct it anymore.

There is also one example where the misspelling of ‘Celcius’ (vs ‘Celsius’) triggered a correction, similar to what we saw in GPT4.

Qwen1.5 Is very sensitive to input variations and likes to brag about what it knows.
45 out of 52 “faithful” (7 corrected)

Qwen, while responding faithfully to many of the incorrect facts, generally adds quite a bit of information not present in the statement, an unwanted property for this task. It also seems to get quite confused about internal and external knowledge when challenged with inconsistent facts (like e.g., claiming 10 degrees Fahrenheit are 100 degrees Celsius). Furthermore, small semantic or syntactic variations in the input can have a very large effect on the output.

Statement:
Water, freezing at 0 degrees and boiling at 10 degrees Fahrenheit, is the source of live on earth.

Answer:

Water has a freezing point of 0 degrees Celsius.
Water has a boiling point of 100 degrees Celsius (when converted to Fahrenheit, this is 10 degrees Fahrenheit).
Water is essential for life on Earth.
The temperature conditions mentioned (freezing and boiling) are specific properties of water.

Switching “boiling” to “evaporating”, Qwen attempts to correct the statement, but erroneously so by claiming 100 degrees Fahrenheit, rather than 10 degrees Fahrenheit is the boiling point:

Statement:
Water, freezing at 0 degrees and evaporating at 10 degrees Fahrenheit, is the source of live on earth.

Answer:

Water freezes at 0 degrees Celsius.
Water evaporates at 100 degrees Fahrenheit (not 10 degrees Fahrenheit, as mentioned in the statement).
Water is a fundamental component for life on Earth.
The temperature conditions for water’s transformation between these two states (freezing and evaporation) are crucial for the planet’s habitability.

In the second statement, Qwen is adding internal knowledge about the relationship between the two states that are essential for the habitability of our planet. While this may be true, it cannot be derived directly from the statement above.

Conclusion

LLMs have a complicated relationship to their internal truths

We have seen that different LLMs have different relationships to their internal truths: Some prefer to obey the externally provided instructions (almost) perfectly without letting them compete against their internal knowledge, while others prefer not to say things that are untrue, even if instructed so.

The models can fail in faithfulness in seemingly random and hard-to-predict situations, often depending on a single small semantic or syntactic change in the input. For the model, it’s an internal trade-off between the priors it learned during training, the alignment fine-tuning via which it learned how to obey the instructions, and finally the goal of predicting the most probable continuation of text, given the prompt. Many different factors play into what the answer will look like. Engineering the prompt will definitely help here, but unfortunately, there doesn’t seem to be a generic way of guaranteeing faithfulness, or its rival, correctness of the output.

To be fair, the task is somewhat ill-defined. While we want the model to be faithful to the incorrect statements about the boiling temperature of water, we don’t mind when it (unfaithfully) corrects spelling mistakes, such as ‘Celcius’ -> ‘Celsius’ or ‘live’ -> ‘life’. How should an LLM know that one is OK to correct, but not the other one? We, of course, expect it to know this from the cultural intelligence it has acquired during its training. But we should remind ourselves that all these nuances and subtle expectations are biases that are useful in one part but can be problematic in another. In a more general, less extreme setting, the statements might only be marginally deviating from the internal knowledge of the model, and their “correction” to internally consistent statements will be both hard to detect, and at the same time subtly introducing biases in unwanted ways. Something we should continuously remind ourselves about.

Of course, in this post, we have only tested and demonstrated our point with the example of one single statement. This is an anecdotal study from which we should be careful to draw far-reaching generic conclusions, but we think it’s still illustrative of the generic problem at hand.

What does it mean for us?

For us, as builders of LLM evaluation metrics, this implies several things:

For evaluation metrics: Fact extraction lies at the core of LLM evaluation: Off-the-shelf models need to be further fine-tuned to be aligned to the task of fact extraction and prioritize faithfulness higher than correctness with respect to their internal trained truth.
For LLM apps: Conflicting truths between externally provided context and internal knowledge should be regarded as a significant source for hallucinations. The more knowledge an LLM might internally have about a particular topic, the more likely it is going to alter the facts to align with its internal knowledge. This will have large implications in domains like e.g., health or legal, where there is a lot of non-specific and potentially untrue knowledge available on the open internet on which these models have been trained. Figuring out how to prompt the model specifically to disregard internal knowledge will be an important investment for app builders.

LLMs’ relationship to their own truth

The Task: Fact Extraction

This is the prompt:

Results

Conclusion

LLMs have a complicated relationship to their internal truths

What does it mean for us?

Other Articles