The Problem of String Concatenation and Format String Vulnerabilities

String concatenation and format string vulnerabilities are a problem in many programming languages. This blog post explains the basics of string concatenation and insecure string concatenation functions in C. It then examines format string vulnerabilities, how they appear in different web applications, and their relation to XSS vulnerabilities.

The Problem of String Concatenation and Format String Vulnerabilities
If JavaScript is your programming language of choice, you probably don't have to worry about  string concatenation a lot. Instead, one of the recurring problems you might encounter is having to wait for JavaScript's npm package manager to install all of the required dependencies. If that sounds all too familiar, and you have some time on your hands until npm is done, you might as well spend it reading about how string concatenation works in other languages.

The Problem of String Concatenation and Format String Vulnerabilities

In this blog post, we examine why string concatenation is a complicated topic, why you can't concatenate two values of a different type in low level programming languages without conversion, and how string concatenation can lead to vulnerabilities. We'll also explain how format strings that contain placeholders for certain types of data can cause serious trouble if they are controlled by an attacker. And, we'll conclude with a simple way to fix them. First, let's look at JavaScript string concatenation, which suffers from its own peculiarities when joining data of different types. Here are some examples: 1 + 1 === 2                               // Obviously '1' + 1 === '11'                            // How intuitive 1 + true === 2                          // Okay? 1 + {} === '1[object Object]'    // When would you ever need that? true + undefined === NaN   // It's obviously not a number, but what is it? typeof NaN === 'number'    // Well, okay then. JavaScript isn't so perfect, after all. And, while it's easy to concatenate two values of different types in JavaScript, it does not necessarily lead to an expected – or even useful – result. Some confusion stems from the fact that, in JavaScript, the concatenation operator is the same as that for addition, and so it must decide whether to concatenate the values or add them. Let's take a look at another loosely-typed language, like PHP. How does it handle concatenation? Well, in PHP you don't have the plus operator. Instead you use a dot as shown: "The number one: " . 1          // Works as expected "10" . 1 === "101"                      // This just literally adds a 1 to the string 10 "10" . 1.0 + "1.0"                       // Guess what this one does! Obviously, this is (float) 102. You concatenate the string "10" with the float 1.0, which results in the string "101". Then you add the string "1.0" to the string "101" and you end up with (float) 102. See? It's not confusing at all. And yes, you can add two strings together and end up with a value of the type float.

Back To The Basics of String Concatenation

We could talk about the problems with PHP's typing system and how it leads to real life vulnerabilities (see Detailed Explanation of PHP Type Juggling Vulnerabilities). But, let's just admit that string concatenation in JavaScript and PHP is really convenient. But, do you know what happens if you do this using a low level language like C, for example? If you try to add a float to a string, C thinks you are out of your mind!

In this example, C does not know what you are trying to achieve. C wouldn't even let you add two strings together to concatenate them, let alone strings of differing data types. Here, strings are just a sequence of characters with a NULL byte at the end. Usually, there is a pointer to the first byte of the string. The char * in the error message refers to that pointer. In JavaScript, you conveniently don't have to worry about NULL bytes or pointers. Strings are also immutable in JavaScript, so you couldn't just change any data in them. Instead, you'd need to create a new string if you wanted to change data, as the old one stays untouched.

In this example, we have tried to change the word ‘String’ to ‘Strong’, but it doesn't work. The x variable wouldn't have changed. This may seem limiting, but it's actually a pretty nice feature, as you may know if you've ever worked with Objects in JavaScript (keyword: deep copy). But, let's get back to our C error message.

Concatenating Different Data Types is Hard

double refers to the so called "double precision floating point numbers". For those who don't know, floating point numbers are stored in the memory as a cryptic sequence of bytes that don't appear to make sense. The reason for this is that they are usually stored in scientific notation in the format defined in IEEE 754 (see the step by step tutorial Binary Fractions and Floating Point!). It's not a mandatory read. What's important here is that they look confusing and don't really contain any recognizable part of the original floating point number. How would C join a string of characters with a weird-looking mix of bytes that is the floating point number, in memory together while preserving their initial meaning? The simple answer is: it can't. If you appended the floating point number to the end of the string and moved the NULL byte accordingly, it would simply treat these bytes as text and would try to print garbage or characters that can't be displayed. This would be unthinkable in JavaScript; it's a common problem in C programming. So, in order to add the floating point number to a string, C needs to convert its IEEE 754 format to a human readable string representation first. For this, the C library provides the *printf family of functions.

Insecure C Functions

When you want to put data of any type into a string, or concatenate strings together, you can simply use printf. You need to pass it a format string that contains placeholders that allow you to specify the type of the data, among other things. Let's look at an example.

If you compile this code, and run it, it will simply print 'This is a string.'. As you may have guessed, %s is the placeholder for strings, but there are placeholders for other data types as well, such as integers, floating point numbers or even single characters. You can also print the data in hex representation. However, there is a catch. If you use it incorrectly it is completely insecure. Let's just examine this first. How are you supposed to concatenate two strings in C? Many of us would consult Stackoverflow, because it seems like a straightforward question about a common problem… Well of course not!
  • The top voted answer uses strcat, a function that doesn't check any size at all, which is always a bad thing in C, as it can lead to buffer overflows if you aren't careful. A comment below the question suggests to use strlcat, which seems to be a safer version of strncat, which people assume to be a more secure version of strcat. But then someone else suggests using strcat_s. This seems to be yet another more secure version of strcat. That's not really straightforward, is it?
  • Someone else in Stackoverflow likes to use snprintf, which apparently is "a Big no no" according to a comment, as apparently _snprintf is the insecure one. But, another post called Stop using strncpy already! mentions that it's okay to use. If you were not already familiar with Stackoverflow or C, this thread  perfectly illustrates the problems with both.

Format String Vulnerabilities

I've already mentioned that *printf functions are dangerous if used incorrectly. But how can a function, whose sole purpose is to format output, lead to exploits that possibly result in arbitrary code execution? A detailed answer is beyond the scope of this blog post, but here is an overview. Local variables and function arguments are stored in a special place in memory – on the stack. For an x86 Linux binary, this usually means that once you call a function, it will grab the function parameters directly from the stack. So what happens if you have a printf call without parameters, as below?
printf("%x%x%x%x%x");
It will simply grab whatever data is on the stack and print it in hex format. This may include stack and return addresses, stack cookies (a security mechanism that aims to prevent buffer overflow exploitation), the content of variables and function parameters, and everything else that is immensely useful for an attacker. So if the format string in printf is user controllable, that's incredibly dangerous. Also, if you use enough format specifiers, you might end up with a stack pointer that points to user-controllable input, like the format string itself. From here it gets worse. You can then provide an address to anywhere in the memory and read data with %s. Additionally, if %n is enabled, you can write arbitrary data anywhere in the memory, namely, the number of bytes that were already printed. That doesn't sound like much, but it could allow attackers to overwrite return addresses, and redirect the code flow to their advantage.

Do Web Applications Contain Format String Vulnerabilities?

Format strings are not only available in C. Here are some other languages in which you will also find them used for web applications.

PHP

If you are familiar with PHP, you might know that it also has a printf function. However, PHP will check whether there are more format specifiers than function parameters (with a few exceptions). You would usually only get a compiler warning if you used printf in an insecure way in C. PHP would simply abort execution of the script.

Ruby

Ruby is similar to PHP. When you try to use Ruby's printf function, it will compare the number of arguments to their corresponding format specifiers. If there are more specifiers than parameters, the execution will halt.

Perl

Perl is different. Perl will happily accept as many format specifiers as you pass to it, but the result won't be very useful. You can, however, use the %n specifier to overwrite variables with the number of characters that have already been printed. Here is a code sample:
$str = "This is a string";
printf("AAAAAAAAAAAAAAAAAAAAAAAA%n\n",$str);
print($str); # this is 24 now, as there were 24 'A's printed
There is a further, not-so-obvious problem. The Perl comparison table is an almost unparalleled atrocity, due to the sheer amount of comparison results that are unexpected. It is similar to PHP's, and you need to use the "eq" identifier during a comparison operation instead of the == symbols in order to do a more or less strict comparison. So instead of 1 == 1 you would write 1 eq 0. The == is obviously the way to go for most people and more convenient than typing "eq" in every comparison. If we look at the Perl table, we can see that the integer 0, when compared to any string, returns a match. Now look at the following code:
$databasePassword = "secret pass"; #unknown to attacker
$password = "A password"; #user supplied string
# ... some more code ...
printf("%n <somehow user controlled format string>", $password); #writes integer 0 into $password variable
# ... some more code ...
if($databasePassword == $password) { # this will match!
        print("Password matches");
} else {
        print("Password does not match");
}
This code will print "Password matches". The problem with this is that you don't even need an integer to pass the check. Even the string "0" would work, which is ridiculous. (PHP has similar problems, but it at least needs an integer, which you can't pass as a GET or POST parameter.) This method is still useful in cases where it lets you change the value of input that might not be user-controllable. This can lead to potentially exploitable behaviour later in the code. If you want to learn more about other insecure behaviour in Perl, see The Perl Jam 2.

Lua

Did you know that people write web applications in Lua? There is even a Lua Apache module for that specific purpose. I've seen Lua based web applications in routers, for example. This is probably due to Lua's extremely small size, which makes it suitable for use in routers where disk space is tight. Lua is also available as a scripting language for interface customization in World of Warcraft. It seems that the next logical step is to use it to write web applications. It also has a string.format function similar to printf. It supports only a limited set of format specifiers and %n is not one of them. In addition, it checks whether the number of arguments matches the number of format specifiers. If it doesn't match, an error occurs.

Java

There is a System.out.printf function in Java. Like most other languages, it checks whether the number of arguments matches the number of format specifiers. Again, there is a %n specifier, but it doesn't do what you might expect. For some reason, it will print the appropriate line separator for the platform it's running on. That's confusing if you're coming from C, but you can't expect compatibility with Java's format strings, even though both functions have the same name.

Python

Python is a very interesting case because it has two different kinds of commonly used methods of string formatting. The PyFormat website is dedicated to string formatting in Python, deeming Python's own documentation to be "too theoretic and technical". First of all, there is the older way which uses the format specifiers we already know. It looks like this:
print("This is %s." % "a string")
As you can see, this uses a format string followed by a percentage sign. The arguments are written behind the sign. If there is more than one argument, you need to use a tuple. Here %n is not supported. Also, the number of arguments is compared to the number of format specifiers and python throws an error if they don't match. But there is also a new way of doing string formatting. You use it by calling the inbuilt .format method on a string.
print("{} is {} years old".format("Alice", 42))
It even allows you to access properties of passed objects from within format strings. This can be convenient, as seen in this example.
print("{person[name]} is {person[age]} years old".format(person={"name":"Alice","age":42}))
Both of them print ‘Alice is 42 years old’, as you would expect. They don't really require format specifiers, because Python will automatically convert them into a proper string representation most of the time. The second method, however, might lead to an information disclosure vulnerability. A blog post called Be Careful with Python's New-Style String Format describes this approach well. Basically, depending on the data you pass, an attacker can read sensitive information, going far beyond your intention. Let's take a look at an example.
API_KEY = "1a2b3c4d5e6f"

class Person:
   def __init__(self):
       """This is the Person class"""
           self.name = "Alice"
           self.age = "42"

print("{person.name} is {person.age} years old".format(person=Person()))
While this seems harmless at first, the API_KEY variable could easily be printed if an attacker could control the format string here. But how does it work? First of all, it's important to know that the person object contains more than the name and age attributes we set. It also has the __init__ function that is directly accessible. Python calls it automatically when you instantiate the Person class. It's still a user-defined function available from within the format string. But what can we do here? We can't call functions from within format strings. We can, however, access attributes. In Python, functions have some specific attributes that you generally don't need to use. For example __name__ for the name of the function. However, there is another useful attribute here, called __globals__. The documentation describes it like this:
‘A reference to the dictionary that holds the function’s global variables — the global namespace of the module in which the function was defined.’
That is interesting because by definition, the API_KEY variable is in the global namespace of our __init__ function. This leaves us with the following format string.
"API_KEY: {person.__init__.__globals__[API_KEY]}".format(person=Person())
And that in turn results in the following output:
API_KEY: 1a2b3c4d5e6f
There are other such keys, such as __doc__, which print the documentation string for the function (possibly yielding some useful information). There are also others for file names. The full list is provided in Python documentation contents. I'm not familiar enough with Python to say whether the fix provided in the blog post is suitable or not, but it resembles the blacklist methodology that I'd recommend avoiding. If you absolutely need to provide format strings to users you might want to use the old format. Generally, though, I would recommend trying to avoid user input in format strings.

JavaScript

Generally, JavaScript does not need any format strings. You are expected to implement them yourself with the built-in replace function, which is not at all frustrating and overly complicated. There are external libraries that imitate them, but since the total number of JavaScript libraries is vast, checking each of them for unexpected behaviour is an impossible task. Nonetheless, JavaScript provides you with a way to format output without replacements and concatenation. You can use so-called template literals. As the name suggests, you can't dynamically create such a string during runtime, except if you use a function like eval, but I would strongly advise against that. Here is an example:
const name = "Alice";
const age = 42;
console.log(`${name} is ${age} years old`)
You use backticks, writing the variable name between curly braces after a dollar sign: ${var}. You can also add functions. Since there is no easy way to give users the ability to define their own template literals without allowing them to execute arbitrary JavaScript code, I want to point out that template literals are the single best thing in JavaScript when it comes to bypassing blacklist filters. Why? Such a filter could remove all "(" and ")" characters and therefore make it pretty hard to execute JavaScript code. It's still possible with the onerror event handler and the throw keyword, for example, but template literals are more convenient, depending on the functions you want to execute. You simply write the template literal after the function name and it will be executed:
alert`some popup message`
This is why blacklists are never a good idea.

Format Strings and XSS

There is another threat regarding XSS Vulnerabilities and format strings. If the output that is generated with a format string function is vulnerable to XSS, this should be fixed as soon as possible. There is a safety net for most users though, except those that use Firefox on anything other than iOS. Most versions of Google Chrome, Safari, IE and Edge have an inbuilt XSS filter, that is regularly updated. The problem is in the way these filters work. They compare both the user input and the generated output that is sent back by the server. If they are similar enough and contain dangerous tags or parameters, the page will not be rendered or the dangerous input is removed. These filters became better over time, but they still aren't foolproof. Therefore, if the input is processed on the server side and gets changed, the filters will not detect the vulnerability. In a situation such as the example below, Chrome will currently not detect the XSS vulnerability, and will execute the JavaScript code.
//server side code
printf(user_input, "Alice", 42);
        URL:
https://example.com/?user_input=<iframe src="javascript:alert(1)||%s">
The output will be <iframe src = "javascript:alert(1)||Alice"> and an alert popup window will appear.

Avoid Format Strings That Contain User Input

The impact of format string vulnerabilities is highly dependent on the language in which you use them. The general rule of thumb is to avoid having format strings that contain user input. Instead, you should always pass that input as a parameter to the formatting function, which is the universal way to avoid format string related vulnerabilities. And, of course, you should always sanitize user-provided input depending on the context it will be used in. You should now have a solid, basic overview of format string exploitation in web applications and elsewhere. I recommend that you test out the Netsparker Web Application Security Scanner. It seamlessly integrates with the most popular Continuous Integration solutions and issue tracking systems, giving you more time reading about web application vulnerabilities and less time worrying about them.
Sven Morgenroth

About the Author

Sven Morgenroth - Senior Security Engineer