Legal: Data flow in the PrevBOT project
In this chapter, we will approach the legality of this specific project, starting with an exercise from which ‘everyone’ can learn: gaining an overview of the data flow in the project.
In order to conduct a legal analysis, it is essential to gain an overview of the data flow in the project. In connection with the development of the algorithm in the PrevBOT project, two main groups of data are processed:
- One group is retrieved from publicly available datasets from the National Library of Norway. The purpose of processing this data is to train the AI model in the Norwegian language.
The Data Protection Authority understands that this language training will take place within the work package at the Centre for Artificial Intelligence Research (CAIR)/University of Agder (UiA) as part of the PrevBOT project. The PHS itself assumes that the use of this data will not trigger any data protection issues. Due to the scope of the project, we have narrowed the report to this group of data.
- The second group of data consists of information from confidential chat logs obtained from Norwegian criminal cases (criminal case data).
The chat logs used as evidence in criminal cases consist of transcripts of chat conversations between a perpetrator and victim in which grooming has taken place. A small number of relevant cases were identified in the pre-project ‘Nettprat’ (online chats). For more information about the specific collection of data from chat logs, see page 20 of the report from the Nettprat project.
The chat logs may contain a variety of personal data, depending on what the participants in the conversation share. It is also conceivable that the logs may contain metadata that contains personal data.
Based on the information in the chat logs, the algorithm may be able to capture personal data, even if that data is not explicitly part of the training data. For example, it is conceivable that the algorithm could capture a person’s textual fingerprint, which can often be considered personal data. In such cases, it may be possible – at least theoretically – to re-identify a person with a certain degree of probability, even if no directly identifying personal data is included. The PHS states that such identification requires the existence of a reference database with textual fingerprints. According to the information provided, PrevBOT will not have this function and such a database will not be created.
The personal data processed may pertain to the following categories of data subjects:
- The victim in a criminal case
- The perpetrator in a criminal case
- Third parties that may be mentioned in a chat conversation
When processing chat logs, it is conceivable that the following processing activities related to personal data may be conducted:
- Disclosure of chat logs from various police districts to the Police IT Unit
- Removal (‘cleansing’) of personal data from chat logs at the Police IT Unit
- Discourse of chat logs from the Police IT Unit to CAIR/UiA (provided that the personal data is not completely anonymised)
- Data preparation/structuring at CAIR/UiA (provided that the personal data is not completely anonymised)
- Training of the algorithm at CAIR/UiA (provided that the personal data is not completely anonymised)
- Analysis at CAIR/UiA (provided that the personal data is not completely anonymised)
The Police IT Unit (PIT) has a supporting role in the project and receives a copy of the chat logs directly from local police districts. PIT ensures that the confidential chat logs are securely stored and are not exposed to anyone other than those who have lawful access to the data. Before the chat logs are made available to CAIR at UiA, PIT must remove identifying information about the perpetrator and the victim. The PHS considers that this information is not in any case relevant to the project. At PIT, chat logs must also be machine cleansed, so that names, addresses, phone numbers and any other directly identifying information are redacted and replaced with ‘XX’.
Personal data legislation does not apply to anonymous data. Data is considered anonymous if it is no longer possible to identify individuals in the dataset using tools that could reasonably be assumed to have been used.
About anonymisation
There are many pitfalls when anonymising data and the Data Protection Authority generally considers it challenging to anonymise personal data with certainty. It is therefore important to undertake thorough risk assessments before processing anonymous data, and to employ sound anonymisation techniques.
According to the PHS’s plans, chat logs will be anonymised before they are processed by CAIR at UiA. On that basis, personal data will only be processed within the PrevBOT project from the time the chat logs are made available until anonymisation takes place.
If no personal data is processed during the development phase of PrevBOT, the data protection regulations will not apply. Information from criminal cases may as such be processed in the research project without being restricted by data protection regulations, provided that the resultant data is considered anonymous in accordance with the General Data Protection Regulation.
The way forward – the following is a secondary presentation
The Data Protection Authority acknowledges that there is a risk that personal data may be processed in the PrevBOT project. In any case, the Data Protection Authority assumes that the PrevBOT project will process personal data in the aforementioned processing activities in order to proceed with the legal analysis. As such, a large part of the following will be a secondary discussion and is therefore intended as a guide.
When personal data is processed for research purposes, a number of conditions must be met. The data controller must consider a number of different factors in order to ascertain whether the activity concerns the processing of personal data for research purposes. It is important to note that even if the processing of personal data is found to be for research purposes, the requirements of the GDPR must be upheld. The Data Protection Authority has a general concern that an overly broad interpretation of the research concept could lead to misuse in this particular situation.