Web Security Framework: Problem Statement and Requirements
During formalization of the WebFinger protocol [I-D.jones-appsawg-webfinger], much discussion occurred regarding the appropriate URI scheme to include when specifying a user’s account as a web link [RFC5988].
acctURI = “acct:” userpart “@” domainpart
|HTTP Content Coding Token
|An encoding format produced by the file compression program "gzip" (GNU zip)
|The "zlib" format as described in RFC 1950.
|The encoding format produced by the common UNIX file compression program "compress".
|GZIP file format
|ZLIB Compressed Data Format
|The compress program's file format
|Deflate compression method
|Deflate consists of LZ77 and Huffman coding
Compress doesn't seem to be supported by popular current browsers, possibly due to its past with patents.
Deflate isn't done correctly all the time. Some servers would send the deflate data format instead of the zlib data format and at least some versions of Internet Explorer expect deflate data format instead of zlib data format.
451 Unavailable for Legal Reasons: The 451 status code is optional; clients cannot rely upon its use. It is imaginable that certain legal authorities may wish to avoid transparency, and not only forbid access to certain resources, but also disclosure that the restriction exists.
That was fast.
This document defines several Structured Syntax Suffixes for use with media type registrations. In particular, it defines and registers the “+json”, “+ber”, “+der”, “+fastinfoset”, “+wbxml” and “+zip” Structured Syntax Suffixes, and updates the “+xml” Structured Syntax Suffix registration.
By the URI RFC there is only one way to represent a particular IPv4 address in the host of a URI. This is the standard dotted decimal notation of four bytes in decimal with no leading zeroes delimited by periods. And no leading zeros are allowed which means there's only one textual representation of a particular IPv4 address.
However as discussed in the URI RFC, there are other forms of IPv4 addresses that although not officially allowed are generally accepted. Many implementations used inet_aton to parse the address from the URI which accepts more than just dotted decimal. Instead of dotted decimal, each dot delimited part can be in decimal, octal (if preceded by a '0') or hex (if preceded by '0x' or '0X'). And that's each section individually - they don't have to match. And there need not be 4 parts: there can be between 1 and 4 (inclusive). In case of less than 4, the last part in the string represents all of the left over bytes, not just one.
For example the following are all equivalent:
The bread and butter of URI related security issues is when one part of the system disagrees with another about the interpretation of the URI. So this non-standard, non-normal form syntax has been been a great source of security issues in the past. Its mostly well known now (CreateUri normalizes these non-normal forms to dotted decimal), but occasionally a good tool for bypassing naive URI blocking systems.
As a professional URI aficionado I deal with various levels of ignorance on URI percent-encoding (aka URI encoding, or URL escaping).
Getting into the more subtle levels of URI percent-encoding ignorance, folks try to apply their knowledge of percent-encoding to URIs as a whole producing the concepts escaped URIs and unescaped URIs. However there are no such things - URIs themselves aren't percent-encoded or decoded but rather contain characters that are percent-encoded or decoded. Applying percent-encoding or decoding to a URI as a whole produces a new and non-equivalent URI.
Instead of lingering on the incorrect concepts we'll just cover the correct ones: there's raw unencoded data, non-normal form URIs and normal form URIs. For example:
In the above (A) is not an 'encoded URI' but rather a non-normal form URI. The characters of 'the' and 'path' are percent-encoded but as unreserved characters specific in the RFC should not be encoded. In the normal form of the URI (B) the characters are decoded. But (B) is not a 'decoded URI' -- it still has an encoded '?' in it because that's a reserved character which by the RFC holds different meaning when appearing decoded versus encoded. Specifically in this case, it appears encoded which means it is data -- a literal '?' that appears as part of the path segment. This is as opposed to the decoded '?' that appears in the URI which is not part of the path but rather the delimiter to the query.
Usually when developers talk about decoding the URI what they really want is the raw data from the URI. The raw decoded data is (C) above. The only thing to note beyond what's covered already is that to obtain the decoded data one must parse the URI before percent decoding all percent-encoded octets.
Of course the exception here is when a URI is the raw data. In this case you must percent-encode the URI to have it appear in another URI. More on percent-encoding while constructing URIs later.
As a professional URI aficionado I deal with various levels of ignorance on URI percent-encoding (aka URI encoding, or URL escaping). The basest ignorance is with respect to the mere existence of percent-encoding. Percents in URIs are special: they always represent the start of a percent-encoded octet. That is to say, a percent is always followed by two hex digits that represents a value between 0 and 255 and doesn't show up in a URI otherwise.
The IPv6 textual syntax for scoped addresses uses the '%' to delimit the zone ID from the rest of the address. When it came time to define how to represent scoped IPv6 addresses in URIs there were two camps: Folks who wanted to use the IPv6 format as is in the URI, and those who wanted to encode or replace the '%' with a different character. The resulting thread was more lively than what shows up on the IETF URI discussion mailing list. Ultimately we went with a percent-encoded '%' which means the percent maintains its special status and singular purpose.
From the document: ‘Appendix B. Implementation Report: The encoding defined in this document currently is used for two different HTTP header fields: “Content-Disposition”, defined in [RFC6266], and “Link”, defined in [RFC5988]. As the encoding is a profile/clarification of the one defined in [RFC2231] in 1997, many user agents already supported it for use in “Content-Disposition” when [RFC5987] got published.
Since the publication of [RFC5987], two more popular desktop user agents have added support for this encoding; see http://purl.org/
NET/http/content-disposition-tests#encoding-2231-char for details. At this time, only one major desktop user agent (Safari) does not support it.
Note that the implementation in Internet Explorer 9 does not support the ISO-8859-1 encoding; this document revision acknowledges that UTF-8 is sufficient for expressing all code points, and removes the requirement to support ISO-8859-1.’
Yay for UTF-8!
Shortly after joining the Internet Explorer team I got a bug from a PM on a popular Microsoft web server product that I'll leave unnamed (from now on UWS). The bug said that IE was handling empty path segments incorrectly by not removing them before resolving dotted path segments. For example UWS would do the following:
In step 1 they are given a URI with dotted path segment and an empty
path segment. In step 2 they remove the empty path segment, and in step 3 they resolve the dotted path segment. Whereas, given the same initial URI, IE would do the following:
IE simply resolves the dotted path segment against the empty path segment and removes them both. So, how
did I resolve this bug? As "By Design" of course!
The URI RFC allows path segments of zero length and does not assign them any special meaning. So generic user agents that intend to work on the web must not treat an empty path segment any different from a path segment with some text in it. In the case above IE is doing the correct thing.
That's the case for generic user agents, however servers may decide that a URI with an empty path segment returns the same resource as a the same URI without that empty path segment. Essentially they can decide to ignore empty path segments. Both IIS and Apache work this way and thus return the same resource for the following URIs:
The issue for UWS is that it removes empty path segments before resolving dotted path segments. It must
follow normal URI procedure before applying its own additional rules for empty path segments. Not doing that means they end up violating URI equivalency rules: URIs (A.1) and (B.2) are equivalent
but UWS will not return the same resource for them.
A bug came up the other day involving markup containing
<input type="image" src="http://example.com/.... I knew that "image" was a valid input type but it wasn't until that moment
that I realized I didn't know what it did. Looking it up I found that it displays the specified image and when the user clicks on the image, the form is submitted with an additional two name
value pairs: the x and y positions of the point at which the user clicked the image.
Take for example the following HTML:
If the user
clicks on the image, the browser will submit the form with a URI like the following:
<input type="image" name="foo" src="http://deletethis.net/dave/images/davebefore.jpg">