Fetching The URL Path Within A Perl Script
2012-05-07
Introduction
When developing web applications, it is possible to use an Apache mod_rewrite rule, or similar technique, to have multiple URLs all use the same script. For example, an application that displays a photo slideshow might want all the following URLs to run the same Perl program and dynamically generate each page:
http://www.dispersiondesign.com/slideshow/2/
http://www.dispersiondesign.com/slideshow/3/
http://www.dispersiondesign.com/slideshow/4/
http://www.dispersiondesign.com/slideshow/5/
Within your Perl CGI script, you then need to know what URL was requested by the user, so that the appropriate page can be generated. This article demonstrates the Perl code necessary to fetch and decode the requested URL.
Perl Environment Variables
The Apache web server passes relevant information to a Perl script through the
use of environment variables. These variables can be accessed through the
%ENV
hash.
There are two environment variables that are of interest to us:
Environment Variable | Description |
---|---|
REQUEST_METHOD | Will be either 'GET' or 'POST', depending on the HTTP method that was used for the request. |
REQUEST_URI | Contains the URL path and query string as requested by the user. e.g.: /slideshow/4/?lang=en |
Fetching the URL
First we will fetch the REQUEST_URI environment variable. If the script is run
from the command line, the REQUEST_URI variable will not be set, so we will use
the first command line argument ($ARGV[0]
) instead:
my $path = ''; if ($ENV{'REQUEST_METHOD'}) { $path = $ENV{'REQUEST_URI'}; } elsif ($ARGV[0]) { $path = $ARGV[0]; }
Parsing the URL
Now we will remove the query string from the end of the path, if any query string is present:
# Remove query string $path =~ s/\?.*//s;
It might be desirable to remove consecutive slashes, as well as slashes at the beginning and end of the path:
# Remove excess slashes $path =~ s/\/\/+/\//g; # Remove leading and trailing slash $path =~ s/(^\/)|(\/$)//g;
This allows us to then split the path into segments:
my @split_path = split(/\//, $path);
Handling Non-ASCII paths
The path needs to be manipulated in several ways to reverse the encoding required by the HTTP specification. Firstly, any pluses (+) should be converted back to spaces.
$segment =~ s/\+/ /g;
Secondly, any reserved URL characters will be 'percent-encoded'. We need to find any percent-encoded bytes and convert them back to their non-encoded equivalent.
$segment =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;
After the percent encoding is decoded, we need to decide what character
encoding to use when interpreting any non-ASCII characters. We will assume that
the path was encoded as a UTF-8 string prior to percent-encoding, so we will
convert the UTF-8 string back to a Perl Unicode string using the
Encode::decode_utf8()
function:
use Encode; $segment = Encode::decode_utf8($segment);
These steps need to be performed on each segment of the path, so we wrap all
the steps in a foreach
loop:
use Encode; my @segments; foreach my $segment (@split_path) { # Convert '+' to ' ' $segment =~ s/\+/ /g; # Decode percent-encoding $segment =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge; # Decode UTF-8 encoding $segment = Encode::decode_utf8($segment); push(@segments, $segment); }
Putting It Together
Placing all this into a single function, we get:
use Encode; sub fetch_path { my $path = ''; if ($ENV{'REQUEST_METHOD'}) { $path = $ENV{'REQUEST_URI'}; } elsif ($ARGV[0]) { $path = $ARGV[0]; } # Remove query string $path =~ s/\?.*//s; # Remove excess slashes $path =~ s/\/\/+/\//g; # Remove leading and trailing slash $path =~ s/(^\/)|(\/$)//g; my @segments; foreach my $segment (split(/\//, $path)) { # Convert '+' to ' ' $segment =~ s/\+/ /g; # Decode percent-encoding $segment =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge; # Decode UTF-8 encoding $segment = Encode::decode_utf8($segment); push(@segments, $segment); } return \@segments; }