Hacking Large Language Models (LLMs)

What is an LLM

An LLM is a model trained on a large amount of data to predict and generate text, as well as understand words and sentences in context.

Some definitions of LLM lingo

Transformer: A deep learning architecture based on the ‘attention’ mechanism that determines the importance of each component (e.g., words) in a sequence (e.g., a sentence) relative to others within the same sequence. Instead of reading all words one by one, it will read all words and try to understand the relationship between each word. Words are converted into vectors through a lookup from a word embedding table. Transformers perform self-attention, and are much faster and better at understanding large texts. Transformers are the backbone for LLMs today, such as ChatGPT (Generative Pre-trained Transformer). They will process sequences in parallel, reducing the training time compared to RNNs.

Word embedding: Dense vector representation of a word (dense meaning that it contains useful information, capturing the meaning and relationships between words). In a graph, words that are close to each other (like similar meanings) are closer to each other in the vector space (in a graph, remember vectors?). Words are converted into numbers and set in a high-dimensional space (can be hundreds, thousands, etc.) where each word has multiple coordinates. This numerical representation helps the computer understand the semantic relationship of words and how they can be related.

RNN (Recurrent Neural Network): This type of neural network is designed to process data sequentially, where the order of words is important. RNNs use past information to influence future decisions. Basically, it will make smart predictions based on previous words that it has remember. Unfortunately, RNNs suffered from short-term memory loss, making it difficult to remember long sentences. An example would be ““I grew up in France and speak fluent __.” The correct answer (French) depends on remembering “France,” which appeared in a word earlier in the sentence. A simple RNN might forget “France” by the time it reaches the blank and make a random guess based on smart predictions (e.g., “English”). The RNN can’t hold onto that information because its memory is too weak. This was solved by introducing the LSTM architecture.

LSTM (Long short-term memory): A sequence-based type of neural network, processing every word one by one, and that will keep what it considers important details and discard what it considers non-important details. From the example above, it would consider keeping the “France” detail, in order to make a more accurate prediction. This is useful for longer sentences, in order not to “forget” what was previously said. It works with cells and gates. It uses a memory gate which stores information, a “forget” gate that will discard details, an input gate which takes new information, and an output gate which decides what information to output. Used mostly in the past for chatbots and speech recognition (like Siri for example).

Neural network: Machine learning process called deep learning which teaches computers to process data inspired by how the human brain works. It learns from its mistakes to improve and “do better next time”.

So, what is a LLM and how does it work

A large language model (LLM) is a type of neural network built on a transformer architecture which will process and generate human language. It tokenizes text into smaller units (words or subwords) and converts them into word embeddings. These numerical values capture semantic relationships, and then generates an answer based on the context and relationship between the tokens.

Transformers, unlike RNN or LSTM models which process text sequentially, will use self-attention mechanisms to analyze all tokens in parallel which enables them to relate words to each other in a specific context. Through massive training on data, LLMs will learn patterns and generate new a new text, token by token, predicting the most likely word based on everything it has captured so far.

Geeking out a bit (you can skip)

How does it process your input

When typing a question to an LLM, for example ChatGPT, it first converts all our words into tokens which are then going to be embedded into numerical representations.
ChatGPT will process these tokens using attention mechanism, tokenization and word embedding to understand the context of our question. This will help it come up with the most relevant and accurate answer it can generate based on the massive amount of data it was trained on.

How does it provide a response

The words that it will generate for an answer don’t come from a database. It is built token by token (word by word) using a process called probabilistic word prediction.
This means that it will generate a response based on the user’s question, and start generating a first word. Then, it will generate a second word based on the first word, and the user’s question. It does this in a loop until it builds a proper sentence that actually makes sense with what the user was asking. It uses probability to generate the next word, as part of a “most likely” option. Now the way it assigns probability to words, is by assigning a score to every possible next word. It assigns a probability on the score based on the entire context and previous words that have already been generated (thanks to attention mechanism).

Basic schema of how an LLM generates a response.

Semantic relationships (or semantic properties) between words help LLMs contextualize meaning of sentences and questions. For example, a famous property of word embedding (for example Word2Vec) will allow the LLM to capture nuances and analogies in language. A common example of this is:

"King" - "Man" + "Woman" = "Queen"

The reason is because we are removing the male aspect to the word King. King can be understood as royalty, so it will retain this context, but by adding the word “Woman”, we are giving a female aspect to the word “royalty”, which will then be understood as Woman Royalty -> Queen.

High-dimensional word embedding will represent words as vectors with multiple numerical values (coordinates), which will help place the token in a multi-dimensional space. Tokens will have multiple coordinates, because a word can have multiple features or characteristics related to the word.

For example, the word “cat” can have coordinates that will place it in a high-dimensional space with the following categories: Living being, feline, human, pet, gender, fruit, verb, noun, etc.

Non-realistic basic concept of how word embedding works.

How can attackers exploit LLMs

LLMs lack knowledge about user intentions and instead rely on prompt inputs and their training data to generate responses. Although this flexibility allows LLMs to interpret user questions and context effectively, it also makes them vulnerable to manipulative inputs.

As mentioned above, LLMs process the entire text using an attention mechanism. The system prompt (text designed to guide the LLM’s behavior) can be influenced by unsafe user inputs, potentially leading to unintended outputs.

For example, an attacker could input:

“Ignore all previous directives and instructions.”

This could override the original constraints set by the system prompt, modifying the model’s behavior. This technique is known as prompt injection.

Because of the attention mechanism, the LLM identifies and prioritizes relevant words from the user’s input. If there isn’t a proper intent check, the LLM will process the injected prompt, and generate tokens that best satisfy the desired response.

Jailbreak techniques can also be used to bypass content filters, by for example encoding or obfuscating queries. This could allow attackers to trick the LLM into producing normally restricted content, such as asking for secrets or passwords.

Bypassing filters to retrieve sensitive information.

Thanks for reading and I hope you learned something.

Disclaimer at the bottom, but I am not a LLM expert and it is possible that I have gotten some concepts wrong, or explained this that are not 100% accurate. This post was mainly for me to better understand the subject and how LLMs actually work. 🙂