Dispersion Design

< Back

Building a URI Regular Expression

2012-05-21

Introduction

Uniform Resource Locators, or URLs, are a type of Uniform Resource Identifier (URI). When solving programming problems, it may be useful to build a regular expression that will match all URIs within a string.

This article will show how to build a regular expression, consistent with the URI specification, that matches URIs.

URI Specification

The specification for a URI is defined in RFC 3986. The URI definition is written in ABNF form, so all we need to do is convert the ABNF definition to regular expression syntax.

ABNF to Regular Expressions

The following shows the ABNF definition for each part of the URI spec, its equivalent regular expression and that expression written in Perl.

I have made one simplification by limiting host names to registered names (reg-name) only and not allowing IP addresses.

URI

ABNF
URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
Regex
{scheme}:{hier_part}(?:\?{query})?(?:#{fragment})?
Perl
$uri = "${scheme}:${hier_part}(?:\\?${query})?(?:#${fragment})?";

Hierarchical Part

ABNF
hier-part     = "//" authority path-abempty
              / path-absolute
              / path-rootless
              / path-empty
Regex
(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?
Perl
$hier_part = "(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?";

URI Scheme

ABNF
scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
Regex
[a-zA-Z][a-zA-Z0-9+\-.]*
Perl
$scheme = '[a-zA-Z][a-zA-Z0-9+\-.]*';

Naming Authority

ABNF
authority     = [ userinfo "@" ] host [ ":" port ]
Regex
(?:{userinfo}@)?{host}(?::{port})?
Perl
$authority = "(?:${userinfo}\@)?${host}(?::${port})?";

User Information

ABNF
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
Regex
(?:{unreserved}|{pct_encoded}|{sub_delims}|:)*
Perl
$userinfo = "(?:${unreserved}|${pct_encoded}|${sub_delims}|:)*";

Host

ABNF
host          = IP-literal / IPv4address / reg-name
Regex
{reg_name}
Modified to only allow registered names
Perl
$host = $reg_name;

Port Number

ABNF
port          = *DIGIT
Regex
[0-9]*
Perl
$port = '[0-9]*';

Registered Name

ABNF
reg-name      = *( unreserved / pct-encoded / sub-delims )
Regex
(?:{unreserved}|{pct_encoded}|{sub_delims})*
Perl
$reg_name = "(?:${unreserved}|${pct_encoded}|${sub_delims})*";

Path Absolute or Empty

ABNF
path-abempty  = *( "/" segment )
Regex
(?:/{segment})*
Perl
$path_abempty = "(?:/${segment})*";

Path Absolute

ABNF
path-absolute = "/" [ segment-nz *( "/" segment ) ]
Regex
/(?:{segment_nz}(?:/{segment})*)?
Perl
$path_absolute = "/(?:${segment_nz}(?:/${segment})*)?";

Path Beginning with Segment

ABNF
path-rootless = segment-nz *( "/" segment )
Regex
{segment_nz}(?:/{segment})*
Perl
$path_rootless = "${segment_nz}(?:/${segment})*";

Path Empty

ABNF
path-empty    = 0<pchar>
Regex No regular expression needed for this parameter

Segment

ABNF
segment       = *pchar
Regex
{pchar}*
Perl
$segment = "${pchar}*";

Segment, Non-Zero Length

ABNF
segment-nz    = 1*pchar
Regex
{pchar}+
Perl
$segment_nz = "${pchar}+";

Segment, Non-Zero Length, No colon

ABNF
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"
Regex
(?:{unreserved}|{pct_encoded}|{sub_delims}|@)+
Perl
$segment_nz_nc = "(?:${unreserved}|${pct_encoded}|${sub_delims}|\@)+";

Path Characters

ABNF
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
Regex
(?:{unreserved}|{pct-encoded}|{sub_delims}|[:@])
Perl
$pchar = "(?:${unreserved}|${pct-encoded}|${sub_delims}|[:\@])";

Query Component

ABNF
query         = *( pchar / "/" / "?" )
Regex
(?:{pchar}|[/?])*
Perl
$query = "(?:${pchar}|[/?])*";

Fragment Component

ABNF
fragment      = *( pchar / "/" / "?" )
Regex
(?:{pchar}|[/?])*
Perl
$fragment = "(?:${pchar}|[/?])*";

Percent-Encoded

ABNF
pct-encoded   = "%" HEXDIG HEXDIG
Regex
%[0-9A-F]{2}
Perl
$pct_encoded = '%[0-9A-F]{2}';

Unreserved Characters

ABNF
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
Regex
[a-zA-Z0-9\-._~]
Perl
$unreserved = '[a-zA-Z0-9\-._~]';

Subcomponent Delimiters

ABNF
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="
Regex
[!$&'()*+,;=]
Perl
$sub_delims = '[!$&\'()*+,;=]';

Simplifications

There are several optimizations that can be made that simplify the regex and improve its performance. For example, the definition for pchar is:

(?:
	[a-zA-Z0-9\-._~]		# unreserved
|
	%[0-9A-F]{2}			# pct-encoded
|
	[!$&'()*+,;=]		# sub-delims
|
	[:@]				# ':' | '@'
)

But can be simplified to:

(?:
	[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
	%[0-9A-F]{2}
)

Perl Function

All the expressions can be put into a Perl function to assemble the complete regular expression:

sub build_uri_regex
{
	my $pchar_char  = '[a-zA-Z0-9\-._~!$&\'()*+,;=:@]';
	my $f_q_char    = '[a-zA-Z0-9\-._~!$&\'()*+,;=:@/?]';
	my $seg_nc_char = '[a-zA-Z0-9\-._~!$&\'()*+,;=@]';
	my $reg_char    = '[a-zA-Z0-9\-._~!$&\'()*+,;=]';
	my $user_char   = '[a-zA-Z0-9\-._~!$&\'()*+,;=:]';

	my $pct_encoded = '%[0-9A-F]{2}';

	my $pchar = "(?:${pchar_char}|${pct_encoded})";

	my $fragment = "(?:${f_q_char}|${pct_encoded})*";
	my $query = "(?:${f_q_char}|${pct_encoded})*";

	my $segment = "${pchar}*";
	my $segment_nz = "${pchar}+";
	my $segment_nz_nc = "(?:${seg_nc_char}|${pct_encoded})+";

	my $path_abempty = "(?:/${segment})*";
	my $path_absolute = "/(?:${segment_nz}(?:/${segment})*)?";
	my $path_rootless = "${segment_nz}(?:/${segment})*";

	my $reg_name = "(?:${reg_char}|${pct_encoded})*";
	my $port = '[0-9]*';
	my $host = $reg_name;
	my $userinfo = "(?:${user_char}|${pct_encoded})*";
	my $authority = "(?:${userinfo}\@)?${host}(?::${port})?";

	my $scheme = '[a-zA-Z][a-zA-Z0-9\-.+]*';
	my $hier_part = "(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?";

	my $uri = "${scheme}:${hier_part}(?:\\?${query})?(?:#${fragment})?";

	return $uri;
}

Completed Regular Expression

The following shows the complete URI regular expression. A version with white space added can be found at the end of this article.

[a-zA-Z][a-zA-Z0-9\-.+]*:(?://(?:(?:[a-zA-Z0-9\-._~!$&'()*+,;
=:]|%[0-9A-F]{2})*@)?(?:[a-zA-Z0-9\-._~!$&'()*+,;=]|%[0-9A-F]
{2})*(?::[0-9]*)?(?:/(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-
F]{2})*)*|/(?:(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+
(?:/(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)?|(?:[a
-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+(?:/(?:[a-zA-Z0-9\-
._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)?(?:\?(?:[a-zA-Z0-9\-._~!$
&'()*+,;=:@/?]|%[0-9A-F]{2})*)?(?:#(?:[a-zA-Z0-9\-._~!$&'()*+
,;=:@/?]|%[0-9A-F]{2})*)?

Final Thoughts

If you are given a URI, this expression can be used to parse the URI and split it up into its components.

However, if you try parsing large selections of text with this regular expression, you will quickly discover that there are many instances of non-URIs that will match the expression (e.g. "languages:"). Because of this, it is not useful to use this expression to actually find all URIs within a block of text.

If you are wanting to locate URIs within text, you should be able to use this regular expression as a starting point and modify the rules to make the expression more restrictive. For example, the scheme could be restricted to only HTTP and HTTPS in the following way:

my $scheme = 'https?';

Completed Regular Expression (White-Space Added)

The following shows the complete URI regular expression with white-space and comments added:

[a-zA-Z][a-zA-Z0-9\-.+]*:			{scheme}
(?:
	//
	(?:					{authority}
		(?:				{userinfo}
			[a-zA-Z0-9\-._~!$&'()*+,;=:]
		|
			%[0-9A-F]{2}
		)*
		@
	)?
	(?:					{host}
		[a-zA-Z0-9\-._~!$&'()*+,;=]
	|
		%[0-9A-F]{2}
	)*
	(?:
		:
		[0-9]*				{port}
	)?
	(?:
		/
		(?:				{path-abempty}
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)*
	)*
|
	/
	(?:
		(?:				{path-absolute}
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)+
		(?:
			/
			(?:
				[a-zA-Z0-9\-._~!$&'()*+,;=:@]
			|
				%[0-9A-F]{2}
			)*
		)*
	)?
|
	(?:					{path-rootless}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@]
	|
		%[0-9A-F]{2}
	)+
	(?:
		/
		(?:
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)*
	)*
)?
(?:
	\?
	(?:					{query}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@/?]
	|
		%[0-9A-F]{2}
	)*
)?
(?:
	#
	(?:					{fragment}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@/?]
	|
		%[0-9A-F]{2}
	)*
)?