[ | Date | | | 2020-02-09 22:33 -0500 | ] |
[ | Mod. | | | 2021-03-09 16:31 -0500 | ] |
When using Amazon Web Services S3 object storage service, a pattern appears semi-frequently to try to find out from metadata alone whether two objects are identical.
The idea behind the shortcut is to assume that two objects with identical value will have identical ETag values.
To test this assumption, I upload the same file a bunch of times with different number of parts for multipart uploads, and also as a non-multipart upload, and then compare the ETag values.
In this example, the input is 64 megabytes of zero bytes.
part count | ETag |
---|---|
none | 7f614da9329cd3aebf59b91aadc30bf0 |
1 | a78211a9709e5a28de9e2fd6eda275f2-1 |
2 | e37c18c50eb968283fd30eee4930d0b2-2 |
4 | e4336b5de4e2180a53fe2e17d03abe4f-4 |
8 | e025c614e2d55d4ae4a8dbc6eeda3220-8 |
Conclusion: Objects with identical values can have differing ETag values. The assumption is invalid.
How was the ETag value actually computed on this example?
For the non-multipart upload, it is the MD5 hash of the value, printed as a string of hexadecimal digits (head -c 64M /dev/zero | md5sum
prints 7f614da9329cd3aebf59b91aadc30bf0
.
For the multipart upload, we can see that the ETag is a string of the same length as a hex MD5, followed by a dash, followed by a number which matches the number of parts. Some fiddling finds that the first dash-separated part is the hex MD5 sum of the appended binary MD5 sums of each part:
# One part:
# simulate md5(md5(64MB of zero bytes))
$ head -c 64M /dev/zero | openssl md5 -binary | md5sum
a78211a9709e5a28de9e2fd6eda275f2 -
# Two parts:
# simulate md5(md5(32MB of zero bytes), md5(32MB of zero bytes))
$ head -c 32M /dev/zero | openssl md5 -binary | tee - | md5sum
e37c18c50eb968283fd30eee4930d0b2 -
# Four parts
$ head -c 16M /dev/zero | openssl md5 -binary | tee - - - | md5sum
e4336b5de4e2180a53fe2e17d03abe4f -
# Eight parts
$ head -c 8M /dev/zero | openssl md5 -binary | tee - - - - - - - | md5sum
e025c614e2d55d4ae4a8dbc6eeda3220 -
The computations above matches the parts generated, all of identical size in each case, but this does not have to be true in general, and using lower-level APIs allows the user to choose the chunk size per-chunk, leading to many more possible ETag values.
In the non-multipart case, it is trivially true that two objects with different values may have identical MD5 hashes, and therefore identical ETags.
The same can be said for in the multipart case, since the ETags appears to also be a 128-bit value, with some metadata appended.
Code used to create the test objects.
Use a command such as the following to list objects and their ETag values (replace BUCKET
and PREFIX
with your own values):
aws s3api list-objects --bucket BUCKET --prefix PREFIX |
jq -r '.Contents[] | [.Key, (.ETag | fromjson)] | @tsv'
Quick links: