Logo and page links

Main menu


NTNU, exit report: Copilot through the lens of data protection

Appendix

Here you can dive into a little more about how language models and RAG work.

Large language models

Large Language Models (LLMs), such as the Generative Pre-trained Transformer (GPT), are machine learning models that have been trained on very large quantities of text data. These models process and generate text by using contexts in the training data to predict the next word in a text or when it answers a question. They are used in a variety of applications, such as chatbots, text generation and language analysis.

LLMs are based on neural networks. They do not store languages and words as text, but as numerical representations called vectors, which effectively describe very complex relationships between language elements. An LLM does not ‘understand’ language in a human sense, but models the relationship between language elements based on how language is used by people.

Since LLMs are so well articulated, it is easy to perceive them as models of knowledge, but they are not. They are models of language itself and of how it is used in practice. In other words, today’s LLMs do not have built-in knowledge of disciplines such as law, chemistry, physics, philosophy and mathematics. At the same time, language, by nature, reflects information about the world around us. LLMs are trained on large amounts of text that contain (random) information about different topics, and different types of information are thus often reflected in the language in which the model is trained. The technology is undergoing rapid development, and LLMs that are combined with knowledge models are both being tested and will increasingly also be available for general use, with the potential for more accurate and reliable answers.

Challenges: hallucination and data base

An important challenge associated with LLMs is the phenomenon ‘hallucination’. This means that the model generates text that is linguistically correct, but that contains incorrect or fictitious information. LLMs function as advanced, statistical models without a built-in understanding of the facts. They have a certain randomness built in (in order to be able to provide varied and/or alternative formulations of answers), but lack mechanisms for assessing the truth of the content, which can lead to generated text appearing to be credible, but actually being incorrect.

In addition to hallucination, errors may occur due to misconceptions in the training data. If a widespread error or misconception is present in the data, the model may repeat or reinforce this error. For example, if many sources in the training material incorrectly claim something, the model is likely to reflect this as if it were correct. This can be problematic when models are used in situations that require a high degree of precision or professional accuracy.

In practice, however, many answers will be good and relevant, given that the vast majority of the texts on which they are trained contain the most correct information, and because the linguistic contexts in many cases also contain the relevant facts.

Linguistic and cultural challenges

Most major suppliers’ LLMs are primarily trained on English-language data, which often generate better answers in English. Efforts are under way to adapt language models to national languages. But will they also be able to adapt them to national cultures? It is important to be aware of this when using LLMs, because they also reflect cultural context. A predominance of English and American text sources in the training material will mean that the text generated is also influenced by and reflects British and American culture.

The largest and most dominant LLMs, such as OpenAI’s models, are created, trained and operated by large, private American companies. Microsoft uses OpenAI models to provide LLM services on its Azure platform, with adjustments and adaptations to its own products such as M365 Copilot.

Unlike ‘classic’ AI/machine learning systems that are mainly modelled and trained for specific purposes, LLMs can be used for many and unspecified tasks. They are therefore also referred to as foundation models. This makes them very useful, but at the same time challenging in terms of ensuring accuracy, relevance and responsible use.

Adaptation of LLMs

M365 Copilot uses several techniques to customise the product, including Knowledge Graphs and Retrieval-Augmented Generation (RAG). The purpose of RAG is to control the quality of text-based answers by retrieving selected and updated information from internal information sources before generating an answer. This new additional information is vectorised and indexed in the same numerical format as the original foundation model.

RAG comprises three main components:

  1. Retrieval: The model searches for information in a database or external sources. This is similar to how traditional search engines work, but RAG uses what are known as semantic search methods to find relevant information based on linguistic context instead of keywords.
  2. Augmented: The information collected is used to enrich the LLM’s answers. This makes the answers more precise and fact-based, compared with the foundation models that are primarily based only on pre-trained data.
  3. Generation: After the relevant information has been collected, answers containing additional information from steps 1 and 2 are generated using the foundation model itself.

RAG’s semantic searches are based on the linguistic context of a question (prompts). This allows the system to retrieve information that is relevant even if the exact words do not match and makes searches more flexible and relevant to the user.

RAG has other features:

  • Updated knowledge: In contrast to LLMs that can only draw on their own static training data, RAG can retrieve in-house information and information from new, self-checked sources, providing more accurate answers.
  • Flexibility: The system can perceive complex intentions and contexts in questions, even if all keywords are not present.
  • Accuracy: By specifying and controlling what information is used as the basis for answers, RAG reduces the risk of errors or ‘hallucinations’ compared to answers produced by the foundation model alone.

Applications for RAG

The purpose of RAG is to be able to control to a greater extent what information to include in answers from LLMs. It also enables some adaptation to specific subject domains to increase the likelihood that the information presented is relevant and correct. However, this requires very good control of what information is included in the RAG model. General LLM solutions such as M365 Copilot will be more able to tie answers to the organisation’s own information, but will still depend on the quality of this information. Unclassified, old, outdated or incorrect information in internal sources will negatively affect quality.

Although RAG can improve LLMs’ ability to provide answers based on the organisation’s own information, there are also challenges associated with implementation, including monitoring the quality of answers. In addition, there are technical challenges related to scaling and performance when these models are used on a large scale, including that many additional operations can lead to longer response times.

LLMs such as GPT represent an important technological innovation in text generation, but have limitations when it comes to updated and fact-based information. However, the RAG system, if implemented correctly, can increase the quality of answers by using in-house information.