So, do we all use Leaky Language Models (LLMs) now?

December 1, 2023

mins read

Suchakra Sharma

Chief Scientist at Privado.ai

The most interesting research work I read this week is this preprint by Nasr, Carlini et. al. where they discuss how easy it is to extract training data from production LLMs such as GPT. Given how ubiquitous LLMs are becoming now, I began wondering if a few years down the line this will be a buffer overflow of the modern internet age. Buffer overflows are still the building blocks of major exploit chains - they are fundamental - all you need is flawed logic, basic algebra and they are literally everywhere. With LLMs now being embedded literally everywhere as well - in all enterprises and tuned with custom data - will these be the attack surfaces to extract private data from?

Membership inference attacks

As discussed by the authors, generative image models like Stable Diffusion have successfully been prompted to reproduce training set data as is. A near identical face of a real person used in training was regenerated successfully. GPT2 has been shown to divulge information by just sending empty strings “” or long strings of text such as “aaaaaaaaa” repeatedly until the model starts to generate training data. The root of such “tricks” is membership inference attack discussed in 2017 (1610.05820.pdf (arxiv.org)) where it was observed that machine learning models often behave differently on the data that they were trained on versus the data that they “see” for the first time - and exploit models can be built upon that. While reading this paper, I recalled that in early 2020, while exploring and trying such attacks on GPT2, I was able to get the model to disclose partial PII information by simply giving it coercive prompts in Bengali language mixed with random input tokens. Within 3-4 prompts, I was able to get it to divulge a name, address and possibly some gas utility information attached to the person.

The generated information was partially correct, the name and part of the address matched a real person (it's anonymized here). Even this early in the LLM game, it was becoming clear that training data can be replayed back to a larger extent. In most recent models, I was able to get llama2-7B locally to disclose 2019 financial statements from Lions Club International (response looked like scraped data from a public PDF) by fuzzing prompts. And this is just me playing around with LLMs in the evening. Now imagine how APTs operate to exfiltrate data - millions of dollars of resources dedicated to developing specialised attack chains, C2 machines etc.

Word-Replay Attack

The crazy part about this paper is that while earlier attacks on models were showing limited replayability, had low success ratio (Pythia or LLaMA emit memorised data less than 1% of the time), this work was directly tested on production ChatGPT with a success rate of 150x! This means the word-replay attacks like this can actually now be weaponized. In one case as shown by the researchers, all they did was repeat “poem” multiple times and were able to get real PII.

🤯 Is this real PII?

We know that LLMs can hallucinate text when really pushed hard. So how do we know this is real PII? Well, in earlier versions of non-production LLM attacks, folks have manually browsed on Google and verified if the replayed data matches a real PII record. However this time, the authors actually downloaded 10TB of internet data and created an index. They found the delta of data pre-ChatGPT and post-ChatGPT training. Any string they found in this delta from the replay attack is what is in the training data. Authors have developed multiple test methodologies to deduce the authenticity of PII. It is safe to say, I am quite alarmed now.

The current strategies of attackers who want to exfiltrate and steal data has been to build traditional exploit chains - explore public facing applications, find vulnerabilities till you are able to reach some S3 bucket and read its data. Recent data breaches such as Okta have shown a drastic impact on the privacy ecosystem. However, given how ubiquitous and publicly available LLMs are now, it would be interesting to see how attackers try to reproduce real data from the “zip file of the internet”

Impact on Privacy

In their 2022 preprint, Brown et al. tried to understand what it means for a LLM model to preserve privacy. One of the interesting points from the paper is the discussion that it is extremely hard to redact information from a training dataset even when we intend to do that. An example of this are two conversations - one between Alice & Bob and another between Bob & Charlie. We see that even when we redact messages from Alice, the contextual information about Alice that Bob and Charlie have in their messages can be enough to reproduce Alice’s private information.

LLMs have no information about the privacy context. Till now, data sanitization has been called as one of the approaches to give some semblance of privacy - identify PII in training sets, anonymize PII or use differential privacy while training. But as we see above, it is extremely hard to do given that context can be inferred by LLMs even with sufficient redaction. Point redaction of private data is possible, but it won’t always lead to LLMs embedding privacy. Other approaches of using privacy layers in request responses from LLMs is also something that is being used, but as we see, bypassing the protection layers in LLMs is becoming quite easy as well. Recently, Yuan et al. have observed that even in production GPT4, you can create a temporary cipher while conversing with the LLM and use that to extract sensitive information which was intended to be blocked in the responses.

‍

Authors show around 60% privacy specific unsafety in the responses with this approach. This means, you can coerce the LLM to give out private information most of the time. If these type of attacks become commoditized, we run into an interesting situation where transparency in what datasets were used and how training occurred for public LLMs becomes important. While on the traditional internet, unintended exposure of data could be corrected, it is almost impossible to detect and correct this in LLMs. There is simply no raw data - it is embedded in a model now.

It is unclear what the implications of this would be to custom fine-tuned LLMs or custom trained LLMs used internally in enterprise, but it is sufficiently clear that those are now also attack vectors to exfiltrate enterprise data. The way to preserve data is to not leak it in the first place. If your training systems don’t pass PHI/PII data from one place to another, they don’t reach LLMs. The behaviour of what to do with data is defined in code and it is in code where the truth about data lies. If you want to understand how to build privacy respecting code, ping me.