Will two S3 objects with identical contents have the same ETag?

Index, feed.

[	Date	\|	2020-02-09 22:33 -0500	]
[	Mod.	\|	2021-03-09 16:31 -0500	]

When using Amazon Web Services S3 object storage service, a pattern appears semi-frequently to try to find out from metadata alone whether two objects are identical.

The idea behind the shortcut is to assume that two objects with identical value will have identical ETag values.

Testing the assumption

To test this assumption, I upload the same file a bunch of times with different number of parts for multipart uploads, and also as a non-multipart upload, and then compare the ETag values.

In this example, the input is 64 megabytes of zero bytes.

part count	ETag
none	`7f614da9329cd3aebf59b91aadc30bf0`
1	`a78211a9709e5a28de9e2fd6eda275f2-1`
2	`e37c18c50eb968283fd30eee4930d0b2-2`
4	`e4336b5de4e2180a53fe2e17d03abe4f-4`
8	`e025c614e2d55d4ae4a8dbc6eeda3220-8`

Conclusion: Objects with identical values can have differing ETag values. The assumption is invalid.

Going further

How was the ETag value actually computed on this example?

For the non-multipart upload, it is the MD5 hash of the value, printed as a string of hexadecimal digits (head -c 64M /dev/zero | md5sum prints 7f614da9329cd3aebf59b91aadc30bf0.

For the multipart upload, we can see that the ETag is a string of the same length as a hex MD5, followed by a dash, followed by a number which matches the number of parts. Some fiddling finds that the first dash-separated part is the hex MD5 sum of the appended binary MD5 sums of each part:

# One part:
# simulate md5(md5(64MB of zero bytes))
$ head -c 64M /dev/zero | openssl md5 -binary | md5sum
a78211a9709e5a28de9e2fd6eda275f2  -

# Two parts:
# simulate md5(md5(32MB of zero bytes), md5(32MB of zero bytes))
$ head -c 32M /dev/zero | openssl md5 -binary | tee - | md5sum
e37c18c50eb968283fd30eee4930d0b2  -

# Four parts
$ head -c 16M /dev/zero | openssl md5 -binary | tee - - - | md5sum
e4336b5de4e2180a53fe2e17d03abe4f  -

# Eight parts
$ head -c  8M /dev/zero | openssl md5 -binary | tee - - - - - - - | md5sum
e025c614e2d55d4ae4a8dbc6eeda3220  -

The computations above matches the parts generated, all of identical size in each case, but this does not have to be true in general, and using lower-level APIs allows the user to choose the chunk size per-chunk, leading to many more possible ETag values.

What about the reverse?

In the non-multipart case, it is trivially true that two objects with different values may have identical MD5 hashes, and therefore identical ETags.

The same can be said for in the multipart case, since the ETags appears to also be a 128-bit value, with some metadata appended.

References

Code used to create the test objects.

Use a command such as the following to list objects and their ETag values (replace BUCKET and PREFIX with your own values):

aws s3api list-objects --bucket BUCKET --prefix PREFIX |    
  jq -r '.Contents[] | [.Key, (.ETag | fromjson)] | @tsv'

www.kurokatta.org

Quick links:

Photos: Montréal; Oregon; Paris; Camp info 2007; Camp Faécum 2007; --more--
Doc: Jussieu; Japanese adjectives; Muttrc; Bcc; Montréal; Couleurs LTP; French English words; Petites arnaques; --more--
Hacks: Statmail; DSC-W17 patch; Scarab: dictionnaire de Scrabble; Sigpue
Recipes: Omelette soufflée au sirop d'érable; Camembert fondu au sirop d'érable; La Mona de Tata Zineb; Cake aux bananes, au beurre de cacahuètes et aux pépites de chocolat