Founder GPT-NL: 'First legal-compliant language model'

It aims to be a responsible alternative to ChatGPT, Grok, and other similar tools. GPT-NL is currently under development. It is being developed for use by large companies and government agencies such as the Public Prosecution Service. The language model, which works based on high-quality and legally obtained Dutch data, will now also be trained with data from the news media. TNO researcher Selmar Smit, one of the founders of GPT-NL: “This is the biggest milestone so far.”

Although large language models (LLMs) such as ChatGPT are widely used for simple tasks, companies and government agencies sometimes require the use of a language model for more sensitive documents, including government information and police reports. Concerns are growing about the legal and ethical issues surrounding the use of LLMs for these types of tasks. Most models originate from foreign big tech companies, which operate outside our jurisdiction and legislative oversight.

With the arrival of GPT-NL, an initiative of non-profit organizations TNO, NFI, and SURF, funded by RVO of the Ministry of Economic Affairs, the Netherlands will soon have its language model. GPT-NL is the first large-scale Dutch AI language model trained on only legally obtained data.

Media joins in: a world first

The recent collaboration with dozens of news companies gives the project a big push in the right direction. The members of NDP Nieuwsmedia are making a significant portion of their archive of news articles from over 30 national and regional news titles available to train the language model further. The ANP press agency has also joined the initiative. This is the first time that news publishers have collaborated in this way with an organization developing an AI model.

This is expected to double the amount of high-quality Dutch data on which the model is trained in a single step. With data from these news sites, the model will have access to more than 20 billion “tokens” of articles. These cover topics such as politics, economics, healthcare, and science. Tokens are small pieces of text—words, parts of words, or punctuation marks—that AI uses to understand language.

Stefan Heijdendael, strategic AI advisor at NDP Nieuwsmedia: "The Netherlands wants to play a leading role in the European AI race. For example, substantial investments are being made in a supercomputer for AI, and the Netherlands wants to move toward ‘responsible AI,’ which involves working with systems that hallucinate as little as possible, are trained on quality data, and take privacy requirements and intellectual property into account. GPT-NL is the result of this ambition. Source data is crucial in this regard. A comprehensive dataset makes an indispensable contribution to this."

News articles add not only language but also world knowledge to AI models. However, models such as Grok and ChatGPT are currently using this knowledge without permission or compensation. The archives of news publishers are being scraped en masse without permission or compensation. That is a problem. Heijdendael continues: "Journalism is not free. Our members pay €400 million a year in journalists' salaries alone. If we are not careful, journalism will be outcompeted by the work of our journalists. NDP Nieuwsmedia believes that AI innovation should not lead to news provision by news organizations being replaced by that of tech companies."

Smit also sees the collaboration with news companies as a milestone. “It's a world first. Of course, other AI parties have agreements with one or two newspapers, but we are doing this on a large scale, with all the major news organizations in the Netherlands. And in a way that allows them to share in our revenues.”

Not in line with legislation

The language model is undoubtedly an ethically responsible one. But to be honest, GPT-4 does not (yet) match the level of GPT-NL. Smit: “We are aiming for something comparable to GPT-3.5 — so we are a few years behind.”

If you develop a completely new system, rather than simply copying the internet, you end up with a model that performs slightly less well.

Nevertheless, Smit believes it is necessary to develop the model. “It's quite strange: large companies and governments are currently using models that are not in line with our legislation. If we were to enforce it, we would suddenly be pretty much the only party allowed on the Dutch market.” Moreover, the quality of GPT-NL is improving. “We realize that accountability alone is not enough. That's why large organizations can fine-tune the model to suit their specific tasks. We believe that this will ultimately lead to the same level of performance.”

Next year is the year

GPT-NL will continue to be trained intensively until the end of October. In this phase, the model is simply learning how language works—writing sentences, composing stories, and recognizing patterns. “We are currently ‘babysitting’, as we call it,” says Smit. “We are training the language model on a large scale, but occasionally something goes wrong. Then we have to reset the system and get everything up and running again.”

This phase is expected to be completed by the end of the year. This will be followed by extensive testing on so-called guardrails: does the model behave properly, does it give inappropriate or dangerous answers? “If the model passes these tests, we hope that the first parties will be able to test GPT-NL early next year,” concludes Smit.

Heijdendael also hopes that GPT-NL will soon be used on a large scale, including in politics. “If we want AI to be future-proof, The Hague and Brussels need to take action now. For example, by aligning their own AI applications with privacy and copyright laws. GPT-NL will soon make that possible.”

Founder GPT-NL: 'First legal-compliant language model'

By: Elcke Vels

Media joins in: a world first

Not in line with legislation

Next year is the year