Regardless which programming language you’re developing software in, you’re likely familiar with the concept of input filtering: the process of validating sanitized data before doing anything useful with it.

If not, the general concept is simple, and can be applied to most things. Business leads, service engagements, and software development lifecycles, all pretty much follow the same pattern:

  • Input stage – data comes in: i.e. a sales lead; new business pursuit; requirements gathering, etc.
  • Filter stage – data is transformed: i.e. sanitization; development/implementation; continuous feedback
  • Output stage – presentation stage: final deliverable; engagement assessment; close of cycle

This 3-step pattern is what I’m referring to when I say “input filtering,” as it’s typically applied to software development and web application security. Before digging deeper, let’s take a look at basic security practices.

Web Application Security

From a security standpoint, data that doesn’t originate from its own endpoint is potentially tainted, and needs to be treated as such. Unless your application is completely static, data will be coming from a remote origin. Even if that origin point is within your own network, it should be considered remote. Remote origins can, and often do, change locations down the line.

Regardless of point-of-origin, your application really has no way of knowing if something unsavory happened to the data during transit – or, if the data itself was constructed and/or transmitted with malicious intent – and, therefore, needs to validate the data’s not tainted. Always.

The data, itself, could be anything. User input; a query string; session value; API feed, etc. Even SERVER variables are, in a sense, remote. In most cases, your application didn’t physically create SERVER variables, so it can’t assume their contents are untainted.

Same principle applies to database content. By default, logging into a database means accessing a system that has been altered since the last time you logged in. Which means it’s just a matter of time until that data is tainted.

Statistically, this is known as the inevitability of probability. Smarter minds than I can craft scoring rules based on this. All I know is: If something can happen, it will happen. That much is inevitable. If it can be tainted, it will be tainted.

In short: Trust nothing.

So, from a software perspective, all incoming data has to be validated by way of filtration. Any invalid data should be discarded, and not used by your application.

Basic Data Filtration

I’m going to use PHP in this example, but these concepts can, and should, be applied to any development stack.

Input filtering can be as simple as validating an e-mail address:

$valid['email'] = filter_input( $_POST['email'], FILTER_VALIDATE_EMAIL );

Or as complicated as validating remote addresses:

$ip_keys = [ 'HTTP_CLIENT_IP', 'HTTP_X_FORWARDED', 'HTTP_FORWARDED_FOR', 'HTTP_FORWARDED', 'REMOTE_ADDR' ];
foreach( $ip_keys as $server_key ) {
	if( array_key_exists( $server_key, $_SERVER ) == true ) {
		foreach ( explode( ',', $_SERVER[ $server_key ] ) as $IP ) {
			if ( filter_var( $IP, FILTER_VALIDATE_IP, FILTER_FLAG_IPV4 | FILTER_FLAG_NO_PRIV_RANGE | FILTER_FLAG_NO_RES_RANGE ) ) {
				$valid['IP'] = $IP;
			}
		}
	}
}

In both cases:

  • Incoming data is validated against an approved whitelist
  • Filtration methods used to validate data are native to the language
  • Once filtered, output variables are easily distinguished from ostensibly tainted input variables

As such, data validation in this example complies with the input filtering design pattern.

Input Filtering Design Pattern

The input filtering design pattern is something I picked up from Chris Shiflett; author of PHP Security. Its most important tenets are:

  • Only accept valid, whitelisted data (rather than trying to prevent blacklisted, invalid data).
  • Choose a naming convention that helps you distinguish valid data from unfiltered data.
  • Whenever possible, use the language’s default sanitation methods vs. trying to create your own.

(In the previous example, I used filter_input() and filter_var(). As of PHP 5.2, these methods allow you to validate types of data based on pre-defined filters, and are therefore preferred over using custom methods.)

Regardless what you’re sanitizing for, using the input filtering design pattern helps enforce security by forcing developers adhere to a specific convention. In other words: Just by having a convention, a developer would have to be willingly malicious to create variables like the following without performing any sanitization:

$sanitized['email'] = @$_GET['email'];
$postgres['firstName'] = @$_POST['username'];

Therefore, the naming convention, itself, encourages best practices.

The naming convention also provides clarity if you’re submitting code for review. A reviewer can quickly scan through code, test against a predefined filtering convention, thereby resulting in a more efficient workflow.

It also provides a faster onboarding process. If a script happens to be thousands of lines long, across several files, a new developer jumping in mid-stream can potentially save time/energy backtracing code that’s easily explained by the design pattern.

Examples of the Input Filtering Design Pattern

For example, let’s say a new user is creating a username. You may want to require alphanumeric values, only. Using the input filtering design pattern, this may look like:

if ( ctype_alnum( $_POST['username'] ) ) {
	$valid['username'] = $_POST['username'];
}

How we know this adheres to the design pattern:

  • Instead of trying to strip out any/all potentially bad characters, we check to see if the suggested username adheres to the standard we do accept
  • By using PHP’s ctype_alnum() method, we employ the language’s default validation method to populate a $valid namespace.

Now, let’s say we want to check to see if the username submitted is already taken. We’ll want to sanitize it first, before handing it off to the database:

$postgres = [];
$postgres['username'] = pg_escape_string( $valid['username'] );

Regardless where the database lookup takes place – could be much later in the script, or in another file – we’ll know $postgres['username'] isn’t tainted:

$postgres['duplicate'] = pg_query( $database_connection, "SELECT COUNT(ID) FROM `users` WHERE `username` = '" . $postgres['username'] . "'" );
if ( $postgres['duplicate'] === true ) {
	// let the user know this $valid['username'] is already taken.
}

At this point, we’ll know if $valid['username'] isn’t already in use, so we may want to let the user where we’re at:

$html = [];
$html['username'] = htmlentities( $valid['username'], ENT_QUOTES, 'UTF-8' );
echo 'Thanks for signing up, ' . $html['username'] . '!';

Working our way back up through this code, we know the following:

  • $html['username'] is sanitized for HTML output
  • $postgres['username'] is sanitized for use with PostgreSQL
  • $valid['username'] is an approved alphanumeric string

Easy to follow.

Templating

The input filtering design pattern provides an underlying convention to sanitize potentially malicious data, but that’s not all it’s useful for. It can also be used to prepare data from a templating perspective.

If you’re using an MVC framework, data’s likely being passed from the model to the view by way of the controller. The downside to sharing templates is that the same view may be asked to display similar data that’s in a completely different format. (Or, you may have to switch data sources one day. Or include a brand new data source that formats things totally differently. Things that once looked great now clash with design specs, or are inconsistent from page to page.)

While that’s inherently a data problem, the solution may not always be data oriented. The data source could be a 3rd party feed, or read-only.

For example, let’s say you start using a new feed, supplied by a third-party, that gives you address info. The format you receive phone numbers in is different from how you normally display them, but you need to maintain the same formatting for consistency:

$template = [];
$template['phone'] = preg_replace( '~.*(\d{3})[^\d]{0,7}(\d{3})[^\d]{0,7}(\d{4}).*~', '$1-$2-$3', $data['phone'] );

Or, let’s say the style guide changed, and you now have to display different months in a different format:

$template = [];
if ( in_array( date( "n", strtotime( $data['$date'] ) ), [ 8, 9, 10, 11, 12, 1, 2 ] ) ) {
	$template['date'] = date( "M. j, Y", strtotime( $data['$date'] ) );
} else {
	$template['date'] = date( "F j, Y", strtotime( $data['$date'] ) );
}

In this case, once inside the template file, a front-end developer will know that $template values have already been massaged to look pretty, regardless what was done to it before it got there.

It’s not uncommon for designers or front-end developers to only have access to HTML templates. Having business logic in the HTML layer can obfuscate the presentation. While there can be some debate about this, doing so can introduce potential bugs if/when HTML elements need to move around. At the end of the day, designers don’t need to worry regular expressions or complicated if statements.

Conclusion

We’ve seen how using the input filtering design pattern is an easy way to enforce security-conscious programming. While that’s crucial for any developer (a future exploit is only as strong as your current vulnerabilities), my main goal here is to underscore the mindset behind the pattern, itself, which can be applied to anything. Somebody comes to you with something. You transform it in a meaningful way, and return it. That’s the Internet in a nutshell. That’s business. That’s life.

So, it’s not just useful from a security perspective. Approaching anything in your professional or personal life from this perspective can help provide clarity in your actions, and help others along the way.