Building a URI Regular Expression

2012-05-21

Introduction

Uniform Resource Locators, or URLs, are a type of Uniform Resource Identifier (URI). When solving programming problems, it may be useful to build a regular expression that will match all URIs within a string.

This article will show how to build a regular expression, consistent with the URI specification, that matches URIs.

URI Specification

The specification for a URI is defined in RFC 3986. The URI definition is written in ABNF form, so all we need to do is convert the ABNF definition to regular expression syntax.

ABNF to Regular Expressions

The following shows the ABNF definition for each part of the URI spec, its equivalent regular expression and that expression written in Perl.

I have made one simplification by limiting host names to registered names (reg-name) only and not allowing IP addresses.

URI

ABNF	URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
Regex	{scheme}:{hier_part}(?:\?{query})?(?:#{fragment})?
Perl	$uri = "${scheme}:${hier_part}(?:\\?${query})?(?:#${fragment})?";

Hierarchical Part

ABNF	hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty
Regex	(?://${authority}${path_abempty}\|${path_absolute}\|${path_rootless})?
Perl	$hier_part = "(?://${authority}${path_abempty}\|${path_absolute}\|${path_rootless})?";

ABNF

hier-part     = "//" authority path-abempty
              / path-absolute
              / path-rootless
              / path-empty

Regex

(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?

Perl

$hier_part = "(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?";

URI Scheme

ABNF	scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
Regex	[a-zA-Z][a-zA-Z0-9+\-.]*
Perl	$scheme = '[a-zA-Z][a-zA-Z0-9+\-.]*';

Naming Authority

ABNF	authority = [ userinfo "@" ] host [ ":" port ]
Regex	(?:{userinfo}@)?{host}(?::{port})?
Perl	$authority = "(?:${userinfo}\@)?${host}(?::${port})?";

User Information

ABNF	userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
Regex	(?:{unreserved}\|{pct_encoded}\|{sub_delims}\|:)*
Perl	$userinfo = "(?:${unreserved}\|${pct_encoded}\|${sub_delims}\|:)*";

Host

ABNF	host = IP-literal / IPv4address / reg-name
Regex	{reg_name} Modified to only allow registered names
Perl	$host = $reg_name;

Port Number

ABNF	port = *DIGIT
Regex	[0-9]*
Perl	$port = '[0-9]*';

Registered Name

ABNF	reg-name = *( unreserved / pct-encoded / sub-delims )
Regex	(?:{unreserved}\|{pct_encoded}\|{sub_delims})*
Perl	$reg_name = "(?:${unreserved}\|${pct_encoded}\|${sub_delims})*";

Path Absolute or Empty

ABNF	path-abempty = *( "/" segment )
Regex	(?:/{segment})*
Perl	$path_abempty = "(?:/${segment})*";

Path Absolute

ABNF	path-absolute = "/" [ segment-nz *( "/" segment ) ]
Regex	/(?:{segment_nz}(?:/{segment})*)?
Perl	$path_absolute = "/(?:${segment_nz}(?:/${segment})*)?";

Path Beginning with Segment

ABNF	path-rootless = segment-nz *( "/" segment )
Regex	{segment_nz}(?:/{segment})*
Perl	$path_rootless = "${segment_nz}(?:/${segment})*";

Path Empty

ABNF	path-empty = 0<pchar>
Regex	No regular expression needed for this parameter

Segment

ABNF	segment = *pchar
Regex	{pchar}*
Perl	$segment = "${pchar}*";

Segment, Non-Zero Length

ABNF	segment-nz = 1*pchar
Regex	{pchar}+
Perl	$segment_nz = "${pchar}+";

Segment, Non-Zero Length, No colon

ABNF	segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":"
Regex	(?:{unreserved}\|{pct_encoded}\|{sub_delims}\|@)+
Perl	$segment_nz_nc = "(?:${unreserved}\|${pct_encoded}\|${sub_delims}\|\@)+";

ABNF

segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"

Regex

(?:{unreserved}|{pct_encoded}|{sub_delims}|@)+

Perl

$segment_nz_nc = "(?:${unreserved}|${pct_encoded}|${sub_delims}|\@)+";

Path Characters

ABNF	pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
Regex	(?:{unreserved}\|{pct-encoded}\|{sub_delims}\|[:@])
Perl	$pchar = "(?:${unreserved}\|${pct-encoded}\|${sub_delims}\|[:\@])";

Query Component

ABNF	query = *( pchar / "/" / "?" )
Regex	(?:{pchar}\|[/?])*
Perl	$query = "(?:${pchar}\|[/?])*";

Fragment Component

ABNF	fragment = *( pchar / "/" / "?" )
Regex	(?:{pchar}\|[/?])*
Perl	$fragment = "(?:${pchar}\|[/?])*";

Percent-Encoded

ABNF	pct-encoded = "%" HEXDIG HEXDIG
Regex	%[0-9A-F]{2}
Perl	$pct_encoded = '%[0-9A-F]{2}';

Unreserved Characters

ABNF	unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
Regex	[a-zA-Z0-9\-._~]
Perl	$unreserved = '[a-zA-Z0-9\-._~]';

Subcomponent Delimiters

ABNF	sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Regex	[!$&'()*+,;=]
Perl	$sub_delims = '[!$&\'()*+,;=]';

Simplifications

There are several optimizations that can be made that simplify the regex and improve its performance. For example, the definition for pchar is:

(?:
	[a-zA-Z0-9\-._~]		# unreserved
|
	%[0-9A-F]{2}			# pct-encoded
|
	[!$&'()*+,;=]		# sub-delims
|
	[:@]				# ':' | '@'
)

But can be simplified to:

(?:
	[a-zA-Z0-9\-._~!$&'()*+,;=:@]
|
	%[0-9A-F]{2}
)

Perl Function

All the expressions can be put into a Perl function to assemble the complete regular expression:

sub build_uri_regex
{
	my $pchar_char  = '[a-zA-Z0-9\-._~!$&\'()*+,;=:@]';
	my $f_q_char    = '[a-zA-Z0-9\-._~!$&\'()*+,;=:@/?]';
	my $seg_nc_char = '[a-zA-Z0-9\-._~!$&\'()*+,;=@]';
	my $reg_char    = '[a-zA-Z0-9\-._~!$&\'()*+,;=]';
	my $user_char   = '[a-zA-Z0-9\-._~!$&\'()*+,;=:]';

	my $pct_encoded = '%[0-9A-F]{2}';

	my $pchar = "(?:${pchar_char}|${pct_encoded})";

	my $fragment = "(?:${f_q_char}|${pct_encoded})*";
	my $query = "(?:${f_q_char}|${pct_encoded})*";

	my $segment = "${pchar}*";
	my $segment_nz = "${pchar}+";
	my $segment_nz_nc = "(?:${seg_nc_char}|${pct_encoded})+";

	my $path_abempty = "(?:/${segment})*";
	my $path_absolute = "/(?:${segment_nz}(?:/${segment})*)?";
	my $path_rootless = "${segment_nz}(?:/${segment})*";

	my $reg_name = "(?:${reg_char}|${pct_encoded})*";
	my $port = '[0-9]*';
	my $host = $reg_name;
	my $userinfo = "(?:${user_char}|${pct_encoded})*";
	my $authority = "(?:${userinfo}\@)?${host}(?::${port})?";

	my $scheme = '[a-zA-Z][a-zA-Z0-9\-.+]*';
	my $hier_part = "(?://${authority}${path_abempty}|${path_absolute}|${path_rootless})?";

	my $uri = "${scheme}:${hier_part}(?:\\?${query})?(?:#${fragment})?";

	return $uri;
}

Completed Regular Expression

The following shows the complete URI regular expression. A version with white space added can be found at the end of this article.

[a-zA-Z][a-zA-Z0-9\-.+]*:(?://(?:(?:[a-zA-Z0-9\-._~!$&'()*+,;
=:]|%[0-9A-F]{2})*@)?(?:[a-zA-Z0-9\-._~!$&'()*+,;=]|%[0-9A-F]
{2})*(?::[0-9]*)?(?:/(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-
F]{2})*)*|/(?:(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+
(?:/(?:[a-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)?|(?:[a
-zA-Z0-9\-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+(?:/(?:[a-zA-Z0-9\-
._~!$&'()*+,;=:@]|%[0-9A-F]{2})*)*)?(?:\?(?:[a-zA-Z0-9\-._~!$
&'()*+,;=:@/?]|%[0-9A-F]{2})*)?(?:#(?:[a-zA-Z0-9\-._~!$&'()*+
,;=:@/?]|%[0-9A-F]{2})*)?

Final Thoughts

If you are given a URI, this expression can be used to parse the URI and split it up into its components.

However, if you try parsing large selections of text with this regular expression, you will quickly discover that there are many instances of non-URIs that will match the expression (e.g. "languages:"). Because of this, it is not useful to use this expression to actually find all URIs within a block of text.

If you are wanting to locate URIs within text, you should be able to use this regular expression as a starting point and modify the rules to make the expression more restrictive. For example, the scheme could be restricted to only HTTP and HTTPS in the following way:

my $scheme = 'https?';

Completed Regular Expression (White-Space Added)

The following shows the complete URI regular expression with white-space and comments added:

[a-zA-Z][a-zA-Z0-9\-.+]*:			{scheme}
(?:
	//
	(?:					{authority}
		(?:				{userinfo}
			[a-zA-Z0-9\-._~!$&'()*+,;=:]
		|
			%[0-9A-F]{2}
		)*
		@
	)?
	(?:					{host}
		[a-zA-Z0-9\-._~!$&'()*+,;=]
	|
		%[0-9A-F]{2}
	)*
	(?:
		:
		[0-9]*				{port}
	)?
	(?:
		/
		(?:				{path-abempty}
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)*
	)*
|
	/
	(?:
		(?:				{path-absolute}
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)+
		(?:
			/
			(?:
				[a-zA-Z0-9\-._~!$&'()*+,;=:@]
			|
				%[0-9A-F]{2}
			)*
		)*
	)?
|
	(?:					{path-rootless}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@]
	|
		%[0-9A-F]{2}
	)+
	(?:
		/
		(?:
			[a-zA-Z0-9\-._~!$&'()*+,;=:@]
		|
			%[0-9A-F]{2}
		)*
	)*
)?
(?:
	\?
	(?:					{query}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@/?]
	|
		%[0-9A-F]{2}
	)*
)?
(?:
	#
	(?:					{fragment}
		[a-zA-Z0-9\-._~!$&'()*+,;=:@/?]
	|
		%[0-9A-F]{2}
	)*
)?