Waterfall: A New Watermarking Method to Protect Copyright in the World of LLMs

17 March 2025

Artificial Intelligence, Faculty, Feature, Research

Bryan Low Kian Hsiang

Associate Professor

Computer Science

SHARE THIS ARTICLE

In December 2023, the New York Times brought a landmark lawsuit against ChatGPT maker Open AI and its biggest backer Microsoft, alleging that they had used millions of its articles without permission to train the massively popular chatbot. The case, which is ongoing, marked the first time a major American media outlet had sued an AI platform for copyright infringement. It also set a precedent for more than a dozen companies and individuals to follow suit — reflecting increasing tensions over the unauthorised use of published work to train AI technologies.

“Parties with ill intentions could potentially use ChatGPT and other large language models (LLMs), including open-sourced ones that they could run on their own computers, to plagiarise millions of articles very quickly with just the click of a few buttons. “There’s a really big problem about intellectual property (IP) protection, and other data issues such as privacy,” says Gregory Lau, a PhD student at NUS Computing’s GLOW.AI lab. Led by Associate Professor Bryan Low, the lab’s focus is on developing various types of AI techniques, including those that can be applied to LLMs.

To help address data provenance issues such as IP protection, GLOW.AI’s researchers have invented a special text watermarking method — called Waterfall, short for Watermarking Framework Applying Large Language Models — which they say performs better than existing, state-of-the-art techniques. The team described their work in a paper published last November, and made their code freely available online.

Taking a Different Approach

Digital watermarking — the process of embedding a code, pattern, or some other unique identifier into content such as videos, photos, and text to prove ownership — can offer “some form of assurance” against copyright infringement, says Lau, who co-led the Waterfall project together with fellow PhD student Niu Xinyuan. “Without it, you stand the risk of not being able to quickly scan through large corpus of text to detect plagiarism and prove it.”

In order to be effective, an ideal watermark should possess certain key characteristics: be robust against modifications such as paraphrasing or conversions to a different form; general enough to be applied to a wide range of formats (including normal text and code); and sufficiently scalable to support millions of users at a reasonable computational cost. Additionally, a good digital watermark should be impossible to detect without the right key or password, says Lau. “You don’t want an adversary to be able to quickly know that the text has been watermarked and try to break it.”

Existing watermarking methods, however, often fall short in one or more areas. For example, some watermarks are added by altering the text or pixels ever so slightly, while others are easily removed once they pass through an LLM’s training process.

Many methods are also model-centric, says Niu, whereby the main aim is to protect output generated by the AI platform itself, rather than the input data per se. This allows, for instance, a teacher to determine if a student penned her own essay, or relied on ChatGPT instead. “Model watermarking is typically used to differentiate between human-written text versus AI-generated text,” he explains.

But this tends to focus on the perspectives of big tech firms and the benefits to them, rather than that of the people who produce the content that’s used for training LLMs, says Low.

Instead, his team took a different approach to watermarking — a data-centric one, focusing instead on protecting the data sources themselves.

A Novel Approach

Waterfall consists of several novel techniques. “There are a few key innovations,” says Lau. “For a start, we’re the first to use LLM paraphrasing as a method to do text watermarking, instead of just perceiving it as a tool for plagiarism.”

In the traditional approach of text watermarking, synonyms for certain words in the original text are generated and used to encode signals. For example, ‘big cat’ may be mapped to ‘big feline’ or ‘large cat.’ The specific combination that is used represents a watermark.

To begin, the researchers took a novel step of tapping the power of LLMs to go beyond replacing words, this time paraphrasing entire sentences and more. “An LLM can completely reorder, break, or fuse sentences while preserving semantic content,” they write in their paper.

For instance, the sentence ‘I ate the pineapple tart’ may be reworded to ‘The pineapple tart was eaten by me’ or “‘I consumed the pineapple baked good.’

LLM paraphrasing offers many advantages. “For synonym watermarking, you can only replace so many synonyms within the passage,” explains Niu. “But in our case, we have the rephrasing and reshuffling of sentences, so we end up with a lot more combinations and we can support a lot more different types of representations while conveying the same meaning. This allows us to support a lot more watermarks and data owners.”

“We’ve also added other ingredients to our watermarking,” says Lau. One is embedding the signal into every single word — a process called n-gram watermarking. “When you do this, you increase the chances of detecting whether there’s any plagiarism, as it provides some defenses against adversaries who could try to modify words to remove the watermark” he says. This may force adversaries to adjust the original text so much that it destroys the value of the IP within and defeats the purpose of even plagiarising it.

Another thing the team did was to embed “a wavy signal, which is a bit like sound waves of different frequencies,” explains Niu, who says they borrowed the concept from signal processing, a field of engineering and applied mathematics. “This helps to improve the computational efficiency of the verification process. It also ensures the watermarks of different parties don’t interfere with one another.”

It is this novel combination of techniques that make Waterfall surprisingly effective in achieving robust verifiability, says Low. When tested against five different threat scenarios and different AI models, Waterfall “performed really well,” he says. Moreover, compared with state-of-the-art text watermarking methods, Waterfall demonstrates better scalability (protecting up to billions — rather than hundreds — of users), requires lower computational cost, and is more versatile (working across different text types and languages, including different coding languages).

The team say their work offers a change in perspective. “People often think of LLMs as infringing on intellectual property rights, but they can also be used to protect IP,” says Lau. “While AI has harmful effects, it also has the potential to benefit society at large — we would like to encourage more people to consider such useful applications of AI.”

Trending Posts

6 November 2024

Reasoning and Planning: New Frontiers for AI

If artificial intelligence (AI) were a person, it would be an adolescent who’s just gone through a growth spurt and come of age. AI can now detect tumours with great ...

5 August 2021

Covid, cake-cutting, and fair resource allocation

The School of Computing at NUS is set in tranquil surroundings — buildings atop gentle hills are connected by breezy walkways, and research labs and classrooms look out onto lush ...

14 June 2019

Power to the consumer – user innovation drives new apps for mobile phones

In 1958, a curious sight began appearing on the sidewalks of California. People were taking apart roller skates, attaching them to the underside of wooden planks, sometimes boxes, and whizzing ...

19 February 2025

Finding the Fastest Route: How a New Algorithm is Revolutionizing Shortest Path Calculations

Finding the Fastest Route: How a New Algorithm is Revolutionizing Shortest Path Calculations Imagine you’re planning the fastest route to work, navigating through a city or even across a massive ...

10 December 2021

No wonder our minds wander!

It’s a pandemic-era feeling we’re all familiar with — you’re listening to a colleague on Zoom or attending an e-learning course...when your mind starts to wander. How many emails do ...

24 June 2022

When disaster strikes, where do people run?

When a natural disaster, terrorist attack, or any other crisis strikes, the best time to act isn’t just as it occurs, but rather in the months, even years, before it ...

19 March 2021

Teaching Hands-On Computer Engineering

For Ravi Suppiah, the term “teaching innovation” has never just been some far-off ideal to strive for when one has the time or energy for reflective improvement. Instead, it’s ingrained ...

6 December 2019

The holy grail of seamless systems integration

Hospital visits can be complicated things. Sometimes it starts out as a visit to the outpatient clinic, where a doctor draws blood or orders some scans to investigate your niggling ...

20 August 2021

The Olympics for Computer Science

The International Olympiad in Informatics (IOI) is one of the most prestigious competitions in the computer science world. Held every summer since 1987, the tournament sees exceptional high school students ...

22 April 2021

Reuse, Recycle…Recode

For an electronic device to ‘know’ what to do, computer programmers need to give it a set of instructions, called code. Writing software programmes can be an immense task — ...

3 January 2023

So you have a dataset? Think about the values it’s missing

Imagine that you’re a book publisher gathering feedback for a new novel that your firm has recently released. Sales figures are useful, but you’re keen to find out more about ...

12 March 2020

Humans, Robots, and the Trust that binds them

Like so many parts along the Californian coast, Honda Point is breathtakingly beautiful. People go to visit, but when they do, it’s not for the views. ...

27 March 2023

New Algorithm Revolutionises a Decades-Old Estimation Problem

When Covid-19 came barrelling through the world, it upended nearly every aspect of our lives, forcing us to live, work, and play in completely new ways. We became accustomed to ...

11 October 2018

Online Shopping and the Science of Serendipity: NUS Computing Researcher Jack Jiang on Product Search in Social Commerce

Have you ever gone to an e-commerce website with the intention of buying one specific thing, but then ended up with something totally different? ...

26 November 2020

Giving start-ups a head start

Every semester, Francis Yeoh spends part of his time in pitch slams. These are intense sessions where teams of students have five minutes to try and sell their start-up ideas. ...

15 June 2023

Plugging the Prefetcher Security Gap

Today’s world moves at such a breakneck speed that it has transformed us into a society that loathes to wait. Online deliveries turn up at our doorsteps in two hours ...

10 April 2023

Building a better detector to guard computers against malicious hardware attacks

The past few years have been a mixed bag for facial recognition. In 2017, the technology stepped into the global spotlight as Apple launched the iPhone X — its first ...

5 January 2024

Detecting Logic Bugs in a Way That’s Quicker and More Effective

Sometime between 2019 and 2020, a curious phenomenon began surfacing on Signal, FaceTime, and four other mobile messaging applications: someone could ring a person up and listen in to the ...

8 December 2022

Is the Right-to-Repair an overrated battle?

For the most part, Henrik Huseby was an average, hardworking man — a small business owner making a modest living repairing iPhones and MacBooks in Ski, a tiny city in ...

21 January 2022

An AI that can read your emotions? Putting safeguards in place

In recent years, some companies, including Amazon, JP Morgan, and Unilever, began asking prospective employees to do a curious thing — to film themselves answering a fixed set of questions. ...

4 January 2023

Problem-first or product-first?

As any Ph.D. student will tell you, paychecks at that level aren’t especially generous. “I was always trying to find cheaper alternatives for household items,” recalls Lim Shi Ying of ...

7 September 2018

Big Data Meets Influencer Marketing: NUS Computing Researcher Tuan Q. Phan Develops “Multinetwork Approach” to Going Viral

One of biggest challenges in marketing is the task of identifying influencers in today’s large and complex social networks, such as Facebook or LinkedIn. ...