S3 is an XML-based API. When you do a list operation i.e. GET /<bucket>?list-type=2 you get a response of the following
HTTP/1.1 200
x-amz-request-charged: RequestCharged
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult>
<IsTruncated>boolean</IsTruncated>
<Contents>
<ChecksumAlgorithm>string</ChecksumAlgorithm>
...
<ETag>string</ETag>
<Key>string</Key>
<LastModified>timestamp</LastModified>
<Owner>
<DisplayName>string</DisplayName>
<ID>string</ID>
</Owner>
<RestoreStatus>
<IsRestoreInProgress>boolean</IsRestoreInProgress>
<RestoreExpiryDate>timestamp</RestoreExpiryDate>
</RestoreStatus>
<Size>long</Size>
<StorageClass>string</StorageClass>
</Contents>
...
<Name>string</Name>
<Prefix>string</Prefix>
<Delimiter>string</Delimiter>
<MaxKeys>integer</MaxKeys>
<CommonPrefixes>
<Prefix>string</Prefix>
</CommonPrefixes>
...
<EncodingType>string</EncodingType>
<KeyCount>integer</KeyCount>
<ContinuationToken>string</ContinuationToken>
<NextContinuationToken>string</NextContinuationToken>
<StartAfter>string</StartAfter>
</ListBucketResult>
The most salient thing for listing is the <Contents>
blocks which enumerate the contents of the bucket. Of particular interest is the <Key>
element, which specifies keys you can look up with a GetObject
API request.
One of the most notorious difficulties with XML is escaping. Control characters like <
are special and should be replaced with <
. Slashes /
are super special and directory deliminators. Furthermore, XML supports a special block syntax <![CDATA[...]]>
for character data that should not be escaped.
So this begs the question, what does the S3 API return when the keys contain special data? I tried uploading files with the following keys, through the S3 console interface
<
<![CDATA[...]]>
foo<Contents>
Interestingly the CDATA block hit something strange which manifested as noise in UI of the type
However, the underlying response in the XML response is escaped with URL encoding (%26
not &
, %3C
not <
). So these represents tricky cases that should be considered when testing vendor conformance or alternative parsing mechanisms.
<Key>%26lt</Key>
<Key>%3C%21%5BCDATA%5B...%5D%5D%3E</Key>
<Key>foo%3CContents%3E</Key>
Amazon publishes a guideline on S3 keynames. Special handling is required for
&$@=;/:+,?
- multiple spaces
- ASCII ranges
0–31
Amazon says to avoid \{^}%\]">[~#|
and backticks- ASCII range
128–255
Trying to upload files with names
&$@=; :+,?
\{^}%\]">[~#|
yields
<Key>%26%24%40%3D%3B++%3A%2B%2C%3F</Key>
<Key>%5C%7B%5E%7D%25%5C%5D%22%3E%5B%7E%23%7C</Key>
Notice spaces are converted to +
while +
is URL encoded to %2B
, this is x-www-form-urlencoded
not the similar URI encoding(!).
Tested in conformance testing here. Note \x00
should work but doesn't (might be a runtime issue).