Logo and page links

Main menu


The Norwegian Police University College, exit report: PrevBOT

Technology: The Tsetlin Machine

The Norwegian Police University College envisages building PrevBOT based on the Tsetlin Machine (TM). The strength of a TM is that it has better explainability than neural networks. In a project like PrevBOT, where people are to be categorised as potential abusers based on (in most cases) open, lawful, online communication, it will be important to be able to understand why the tool reaches its conclusions.

The Tsetlin Machine is a machine learning algorithm designed by the Norwegian researcher Ole-Christoffer Granmo in 2018. Granmo is professor of computer science at UiA, and has further developed the Tsetlin Machine with colleagues. Given it is a relatively new machine learning method, research is being carried out in this area with the aim of exploring and optimising its application and performance capabilities. Like all machine learning models, the Tsetlin Machine depends on the quality and representativeness of the training data.

Tsetlin Machines

The Tsetlin Machine is not a type of neural network. It is an algorithm based on reinforcement learning and propositional logic. The algorithm is suitable for tasks in classification and decision-making where both interpretability and accuracy are important. Propositional logic is an algebraic method that classifies sentences or statements as true or false, using logical operations such as ‘and, or, not, if, then’.

The Tsetlin Machine learns by using reinforcement learning and learning automata. Reinforcement learning means that the model is rewarded or punished based on the results of actions taken, while learning automata make decisions based on previous experiences, and these experiences serve as guidelines for current actions.

The Tsetlin Machine uses clauses to understand how individual clauses affect the decision-making process. This approach makes the Tsetlin Machine suitable for applications in which interpretability plays an important role.

Tsetlin Machines versus neural networks

Neural networks (deep learning models) require large datasets and large computational resources for training. The Tsetlin Machine needs fewer computational resources compared with complex neural networks. Research from 2020 shows that the Tsetlin Machine is more energy efficient, using 5.8 times less energy than neural networks.

Neural networks are suitable for tasks such as prediction and image and speech recognition, identifying complex patterns and relationships in data. The Tsetlin Machine is suitable for certain types of classification problems where interpretability is important. The Tsetlin Machine uses propositional logic in decision-making. It consists of a collection of Tsetlin automata that represent logical rules. Each Tsetlin automaton has a weighted decision that is adjusted based on the learning process. The weighting determines the extent to which a specific characteristic or pattern affects the decision. This provides a higher degree of understanding because the use of logical rules enables decisions to be traced back to the individual clauses.

Neural networks are inspired by the human brain and consist of many layers of artificial neurons that are connected through many nodes and weights. Often complex and non-transparent, they are considered ‘black boxes’ due to their complexity and limited understanding of how they make decisions.

Neural networks may also inadvertently amplify and maintain biases in the training data. If the training data contains biased or discriminatory information, the model can learn and reproduce such biases in the output it generates. This can lead to unintended consequences and reinforce prejudice.

The transparency of the Tsetlin Machine means it can be examined for bias, which may be removed from the model by modifying the propositional logic, instead of indirect changes to the data or via post-training. This indicates that it is easier to correct.

The Tsetlin Machine learns to associate words with concepts and uses words in logical form to understand the concept. An important component of this process is the use of conjunctive clauses, which are phrases or expressions that combine two or more conditions present or absent in the input data in order to be classified as true or false.

For example: ‘I will only go to the beach if it’s sunny and if I get time off work’. Here, ‘if it’s sunny’ and ‘if I get time off work’ represent conjunctive clauses that must be met at the same time in order for the person to make the decision to go to the beach. These clauses are used to identify patterns in the input data, by creating conditions that must be met at the same time. These clauses are then used to build up decision-making rules that form the basis for classification. The ability to handle complex conditions makes the Tsetlin Machine suitable for determining whether or not the input data belong to a specific class.

The workflow of the Tsetlin machine in the PrevBOT project

PrevBOT aims to develop a transparent language model that can classify the presence of grooming in a conversation. The first step is to give the algorithm general training in the Norwegian language. This gives the algorithm a solid understanding of the language, and reduces the impact of potentially limiting datasets when trained in grooming language. If the training is limited to this narrow topic, there is a risk of too few examples being generated. Another important reason is that a general understanding of a language lays the foundation for developing specialised skills in a more comprehensive manner. In order to train the algorithm to master the Norwegian language in general, the use of large Norwegian datasets is useful (available from the Norwegian Language Bank at the National Library). This can also be compared to pre-training in large language models.

Norwegian grooming

Experience from Norwegian criminal cases shows that the abuser and the child communicate in Norwegian. The technology must therefore be based on Norwegian text data. Ensuring there is a sufficient amount of Norwegian text data is therefore a prerequisite for developing the AI model. At the time of the writing for the sandbox project, it is uncertain whether this prerequisite is met, but it may be realised over time, if the method is found to be appropriate.

Once the algorithm’s language skills have reached a sufficient level, the second step is to train it to become a specialist in grooming language classification. After acquiring basic Norwegian skills, the algorithm can then learn word context and the relevance of each word within the grooming language. This will enable the algorithm to master the language at a general level before starting the specific task of grooming detection.

The text in chat logs from criminal cases plays an important role. The examples must be very specific and precise, and should be labelled by an experienced domain expert in the field of grooming. Based on the general Norwegian language training and knowledge of grooming language classification, the algorithm will be able to recognise grooming conversations in Norwegian. A more detailed description of steps one and two follows below.

faktaboks_fra logg til algoritme_eng.jpg

Step one: train the algorithm in Norwegian

First, they must develop Tsetlin machine-based autoencoders that autonomously perform word embedding in large Norwegian datasets. The training consists of producing representations for words, which is done based on large datasets.

The Tsetlin Machine uses principles from propositional logic and logical clauses to make decisions. The figure below shows an example of the results (arrows) of propositional logic embedding using the Tsetlin Machine in a small English dataset. The Tsetlin Machine uses these clauses to build decision-making rules that form the basis for classification.

As illustrated, the results show that the words are correlated with other words through clauses. If we take the word ‘heart’ as an example, we see that it is related to ‘woman’ and ‘love’, and also associated with ‘went’ and ‘hospital’. This example shows that the word has different meanings depending on the context. It indicates that the Tsetlin Machine embedding has the capacity to learn and establish sensible correlations between words. These properties lay the foundation for better explainability and perhaps also manual adjustment.

faktaboks om ordklassifisering_eng.jpg

Step two: classify grooming languages

The training data must contain examples of text labelled either grooming or non-grooming. A selection of relevant rules, whether specific words, phrases or text structure, is essential to provide the algorithm with the necessary information. The algorithm identifies grooming conversations by analysing the language and recognising patterns or indicators associated with the risk of grooming. Positive examples (grooming) and negative examples (non-grooming) are used to adjust the weighting of the clauses.

The examples should in theory be an integral part of the algorithm rules and be used during the training to help the algorithm understand what characterises grooming conversations. The training data containing examples/texts labelled grooming or non-grooming is thus used as part of the training process. They are used to develop and adjust the rules that the algorithm uses to identify grooming conversations. As the algorithm trains, it analyses the labelled examples to learn patterns and indicators related to grooming. By comparing the properties of positive (grooming) and negative (non-grooming) examples, the algorithm gradually adjusts the weighting of the rules or clauses it uses for classification. This may involve assigning more weight to words or phrase structures that are associated with grooming, and less weight to those that are not. The word embedding from step one can be used for classification.

The combination of guided learning and reinforcement learning involves repeated adjustment of the conjunctive clauses. The adjustment is normally automatic and is based on previous decisions. During training, the algorithm learns to adjust the weights to recognise patterns and make correct classifications. A fully trained model is not only expected to be able to classify text as a potential grooming conversation, but also to interpret it due to the transparent nature of the algorithm. The interpretation is based on clauses in a trained Tsetlin Machine model. The clauses consist of logical rules that effectively describe whether the language indicates grooming or not. The rules for interpreting a given input sentence can be obtained from the clauses that have been activated. These rules can then be used to explain the algorithm’s decision.

Simplified overview

  1. Data collection

    Collect Norwegian text from open Norwegian sources (National Library) and chat logs from criminal cases (grooming conversations between potential victims and potential abusers) to form datasets. The datasets should contain varied examples with both positive examples (grooming conversations) and negative examples (non-grooming).
  2. Data preparation

    Structuring the data to fit the Tsetlin Machine, e.g. representing text data by means of vector representations (vectorisation of words). Bag-of-word (BOW) representations (binarisation of words) can also be used.
  3. Goal

    Identify relevant properties in text that distinguish between grooming and non-grooming chats, such as the use of specific words, contextual nuances/clues, sentence structures or tones of voice typical of grooming behaviour.
  4. Training

    Structured data are used for training. During training, the Tsetlin automata adjust their internal parameters to recognise patterns that are characteristic of grooming conversations. This involves adapting logical rules that take into account word choice, context and other relevant factors, specific words, expressions or patterns associated with grooming.
  5. Decision-making

    After training, the algorithm should be able to analyse and make decisions about whether text data contains indications of grooming.
  6. Feedback and fine-tuning

    The results are assessed to reduce false positives and negatives. The model is periodically adjusted based on feedback to improve accuracy over time. This may include new data, fine-tuning rules or introducing new rules to deal with changing patterns.
  7. Implementation

    Real-time detection to report suspected grooming patterns. The Tsetlin Machine predicts the probability of an online chat containing elements of grooming.