There is huge excitement about ChatGPT and other large generative language models that produce fluent and human-like texts in English and other human languages. But these models have one big drawback, which is that their texts can be factually incorrect (hallucination) and also leave out key information (omission).
In our chapter for The Oxford Handbook of Lying, we look at hallucinations, omissions, and other aspects of “lying” in computer-generated texts. We conclude that these problems are probably inevitable.
Omissions are inevitable because a computer system cannot cram all possibly-relevant information into a text that is short enough to be actually read. In the context of summarising medical information for doctors, for example, the computer system has access to a huge amount of patient data, but it does not know (and arguably cannot know) what will be most relevant to doctors.
Hallucinations are inevitable because of flaws in computer systems, regardless of the type of system. Systems which are explicitly programmed will suffer from software bugs (like all software systems). Systems which are trained on data, such as ChatGPT and other systems in the Deep Learning tradition, “hallucinate” even more. This happens for a variety of reasons. Perhaps most obviously, these systems suffer from flawed data (e.g., any system which learns from the Internet will be exposed to a lot of false information about vaccines, conspiracy theories, etc.). And even if a data-oriented system could be trained solely on bona fide texts that contain no falsehoods, its reliance on probabilistic methods will mean that word combinations that are very common on the Internet may also be produced in situations where they result in false information.
Suppose, for example, on the Internet, the word “coughing” is often followed by “… and sneezing.” Then a patient may be described falsely, by a data-oriented system, as “coughing and sneezing” in situations where they cough without sneezing. Problems of this kind are an important focus for researchers working on generative language models. Where this research will lead us is still uncertain; the best one can say is that we can try to reduce the impact of these issues, but we have no idea how to completely eliminate them.
“Large generative language models’ texts can be factually incorrect (hallucination) and leave out key information (omission).”
The above focuses on unintentional-but-unavoidable problems. There are also cases where a computer system arguably should hallucinate or omit information. An obvious example is generating marketing material, where omitting negative information about a product is expected. A more subtle example, which we have seen in our own work, is when information is potentially harmful and it is in users’ best interests to hide or distort it. For example, if a computer system is summarising information about sick babies for friends and family members, it probably should not tell an elderly grandmother with a heart condition that the baby may die, since this could trigger a heart attack.
Now that the factual accuracy of computer-generated text draws so much attention from society as a whole, the research community is starting to realize more clearly than before that we only have a limited understanding of what it means to speak the truth. In particular, we do not know how to measure the extent of (un)truthfulness in a given text.
To see what we mean, suppose two different language models answer a user’s question in two different ways, by generating two different answer texts. To compare these systems’ performance, we would need a “score card” that allowed us to objectively score the two texts as regards their factual correctness, using a variety of rubrics. Such a score card would allow us to record how often each type of error occurs in a given text, and aggregate the result into an overall truthfulness score for that text. Of particular importance would be the weighing of errors: large errors (e.g., a temperature reading that is very far from the actual temperature) should weigh more heavily than small ones, key facts should weigh more heavily than side issues, and errors that are genuinely misleading should weigh more heavily than typos that readers can correct by themselves. Essentially, the score card would work like a fair school teacher who marks pupils’ papers.
We have developed protocols for human evaluators to find factual errors in generated texts, as have other researchers, but we cannot yet create a score card as described above because we cannot assess the impact of individual errors.
What is needed, we believe, is a new strand of linguistically informed research, to tease out all the different parameters of “lying” in a manner that can inform the above-mentioned score cards, and that may one day be implemented into a reliable fact-checking protocol or algorithm. Until that time, those of us who are trying to assess the truthfulness of ChatGPT will be groping in the dark.
Featured image by Google DeepMind Via Unsplash (public domain)