Fetching a Query String in Perl

2012-05-25

Introduction

When developing web applications in Perl, it is common to use the CGI module to decode any query string variables that are passed from the browser to the script. There are times, however, that you may wish to avoid using the CGI module. As it happens, decoding variables from a query string is quite simple to do. This article will explain how to correctly fetch query string key/value pairs.

Limitations of the CGI Module

There are several reasons why you might want to decode your own query string variables in your script and avoid using the CGI module:

1. The CGI module is bloated

The CGI module is fairly large and is overkill in many scripts.

2. The CGI module does not handle Unicode correctly

The CGI module makes no attempt to interpret the character set that was used by the browser to encode the text. If you are using Unicode and UTF-8 encoding consistently — and you should be — then you will have to remember to use the decode_utf8() function on any data you get from the CGI::param() function. Instead, lets have our script interpret the data as UTF-8 automatically.

2. The CGI module does not differentiate between query string variables and POST variables

Sometimes it is useful to be able to use the content of a query string variable before processing the POST data. For example, you might want to check a verification variable, ensuring that the user has the appropriate permission to send POST data to the script. The CGI module does not differentiate between query string variables and POST variables and decodes both at the same time.

Fetching the Query String

The query string, containing our key/value pairs, is made available to a Perl script through the QUERY_STRING environment variable. We can get the value of this variable in the following way:

my $query_string = $ENV{'QUERY_STRING'};

If the script is run from the command line instead of as a CGI script, it might be useful to pass variables as a command line argument. We can create a special case for this:

my $query_string = '';
if ($ENV{'REQUEST_METHOD'})
{
	$query_string = $ENV{'QUERY_STRING'};
}
elsif ($ARGV[0])
{
	$query_string = $ARGV[0];
}

Splitting Up The Pairs

The query string will normally use the following format:

key1=value1&key2=value2&key3=value3

However, there is another, less used, format that we must also be able to handle:

key1=value1;key2=value2;key3=value3

The perl split function will easily split the query string into separate variables:

my @pairs = split(/[&;]/, $query_string);

Splitting pairs into keys and values

We now have an array of strings that look like this:

key1=value1
key2=value2
key3=value3

The next task is to step through the array and split up the key/value pairs:

foreach(@pairs)
{
	my($key, $value) = split(/=/, $_, 2);
	# more processing needed here...
}

At this point, we have the raw key and value strings that need a little more processing.

Decoding Key/Value Strings

Each key and value will be encoded according to the URI specs. Several steps need to be performed to decode the strings.

Firstly, any pluses (+) should be converted back to spaces.

$key =~ tr/+/ /;
$value =~ tr/+/ /;

Secondly, any 'percent-encoded' bytes need to be converted back to their non-encoded equivalent.

$key =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;
$value =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;

Now it is time to interpret the character set of the resulting strings. We will assume that it is encoded as UTF-8, and will decode it as such:

use Encode;
$key = Encode::decode_utf8($key);
$value = Encode::decode_utf8($value);

We can then put the resulting values into an associative array (a hash):

$param{$key} = $value;

Putting It Together

Putting all these steps into a single function results in:

use Encode;

sub fetch_cgi_variables
{
	my $query_string = '';
	if ($ENV{'REQUEST_METHOD'})
	{
		$query_string = $ENV{'QUERY_STRING'};
	}
	elsif ($ARGV[0])
	{
		$query_string = $ARGV[0];
	}

	my %param;
	my @pairs = split(/[&;]/, $query_string);

	foreach (@pairs)
	{
		my($key, $value) = split(/=/, $_, 2);

		next if !defined $key;

		$key =~ tr/+/ /;
		$key =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;
		$key = Encode::decode_utf8($key);

		next if ($key eq '');

		if (defined $value)
		{
			$value =~ tr/+/ /;
			$value =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;
			$value = Encode::decode_utf8($value);
		}

		$param{$key} = $value;
	}
	return \%param;
}

A careful reader will notice that there are a couple of additional statements added to this finished code to prevent undefined keys and avoid warnings if a value is undefined.

Understanding Character Sets for Form Data

Unfortunately, browsers do not always encode form data as UTF-8. Instead, they will encode the form data using the same character set as the document that contains the form. In other words, if your HTML is encoded as UTF-8 (and specifically indicates this), then your form data will also be encoded as UTF-8.

The moral to this story is: USE UTF-8 CONSISTENTLY. Make sure your HTML pages are encoded using UTF-8 and make sure that your server indicates this in the HTTP header:

Content-Type: text/html; charset=UTF-8

Additionally, it is good to also set the character set in the HTML header using the http-equiv meta tag:

<meta
	http-equiv="Content-Type"
	content="text/html; charset=UTF-8"
/>

Following these steps will ensure that data does not get garbled due to incorrect character sets when transmitting from the browser to the script.