Fetching The URL Path Within A Perl Script

2012-05-07

Introduction

When developing web applications, it is possible to use an Apache mod_rewrite rule, or similar technique, to have multiple URLs all use the same script. For example, an application that displays a photo slideshow might want all the following URLs to run the same Perl program and dynamically generate each page:

http://www.dispersiondesign.com/slideshow/1/
http://www.dispersiondesign.com/slideshow/2/
http://www.dispersiondesign.com/slideshow/3/
http://www.dispersiondesign.com/slideshow/4/
http://www.dispersiondesign.com/slideshow/5/

Within your Perl CGI script, you then need to know what URL was requested by the user, so that the appropriate page can be generated. This article demonstrates the Perl code necessary to fetch and decode the requested URL.

Perl Environment Variables

The Apache web server passes relevant information to a Perl script through the use of environment variables. These variables can be accessed through the %ENV hash.

There are two environment variables that are of interest to us:

Environment Variable	Description
REQUEST_METHOD	Will be either 'GET' or 'POST', depending on the HTTP method that was used for the request.
REQUEST_URI	Contains the URL path and query string as requested by the user. e.g.: `/slideshow/4/?lang=en`

Fetching the URL

First we will fetch the REQUEST_URI environment variable. If the script is run from the command line, the REQUEST_URI variable will not be set, so we will use the first command line argument ($ARGV[0]) instead:

my $path = '';
if ($ENV{'REQUEST_METHOD'})
{
	$path = $ENV{'REQUEST_URI'};
}
elsif ($ARGV[0])
{
	$path = $ARGV[0];
}

Parsing the URL

Now we will remove the query string from the end of the path, if any query string is present:

# Remove query string
$path =~ s/\?.*//s;

It might be desirable to remove consecutive slashes, as well as slashes at the beginning and end of the path:

# Remove excess slashes
$path =~ s/\/\/+/\//g;

# Remove leading and trailing slash
$path =~ s/(^\/)|(\/$)//g;

This allows us to then split the path into segments:

my @split_path = split(/\//, $path);

Handling Non-ASCII paths

The path needs to be manipulated in several ways to reverse the encoding required by the HTTP specification. Firstly, any pluses (+) should be converted back to spaces.

$segment =~ s/\+/ /g;

Secondly, any reserved URL characters will be 'percent-encoded'. We need to find any percent-encoded bytes and convert them back to their non-encoded equivalent.

$segment =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;

After the percent encoding is decoded, we need to decide what character encoding to use when interpreting any non-ASCII characters. We will assume that the path was encoded as a UTF-8 string prior to percent-encoding, so we will convert the UTF-8 string back to a Perl Unicode string using the Encode::decode_utf8() function:

use Encode;
$segment = Encode::decode_utf8($segment);

These steps need to be performed on each segment of the path, so we wrap all the steps in a foreach loop:

use Encode;

my @segments;

foreach my $segment (@split_path)
{
	# Convert '+' to ' '
	$segment =~ s/\+/ /g;

	# Decode percent-encoding
	$segment =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;

	# Decode UTF-8 encoding
	$segment = Encode::decode_utf8($segment);

	push(@segments, $segment);
}

Putting It Together

Placing all this into a single function, we get:

use Encode;

sub fetch_path
{
	my $path = '';
	if ($ENV{'REQUEST_METHOD'})
	{
		$path = $ENV{'REQUEST_URI'};
	}
	elsif ($ARGV[0])
	{
		$path = $ARGV[0];
	}

	# Remove query string
	$path =~ s/\?.*//s;

	# Remove excess slashes
	$path =~ s/\/\/+/\//g;

	# Remove leading and trailing slash
	$path =~ s/(^\/)|(\/$)//g;

	my @segments;

	foreach my $segment (split(/\//, $path))
	{
		# Convert '+' to ' '
		$segment =~ s/\+/ /g;

		# Decode percent-encoding
		$segment =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;

		# Decode UTF-8 encoding
		$segment = Encode::decode_utf8($segment);

		push(@segments, $segment);
	}

	return \@segments;
}