AWS: S3V4RestSignerClient should strip out more request headers for caching#15428
AWS: S3V4RestSignerClient should strip out more request headers for caching#15428steveloughran wants to merge 13 commits intoapache:mainfrom
Conversation
|
I don't like that the server client needs to guess which headers in a new request are excluded from the sign and hence safe to cache. Change encryption for example and everything blows up with a signing failure. Either the signer should parse and cache the header list from the signature, or (slightly better) the rest servlet should return that list independently. The signer can then use that header list in the cache information to decide whether to reuse. Simplest strategy: include the list of headers alongside the signature, and if changed, don't use the cache value. |
c56182e to
7f8cbfb
Compare
|
If the signing string is parsed then the cache should already have everything needed to identify conflicting s3 requests. |
|
Thinking about how best to do this
|
|
FWIW, headers on requests picked up by running some cloudstore commands against s3 london, s3a client set to use sse-kms encryption. The bucket is versioned so on repeated reads the if-match header would be used to declare the version. HEAD /fPUT of directory marker (not relevant to s3file IO). adds x-amz-server-side-encryption x-amz-server-side-encryption-aws-kms-key-id, Content-Length, Content-Type.single object DELETEinitiate multipart PUTthe x-amz-meta-headername is because "headername" is an explicit custom header. partadds Expect: 100-continue, but like Connection: Keep-Alive this is being put out on the wire after signing has taken place. The signer doesn't see it, so it's not relevant. completion of MPUno new headers. single file PUTthis with with options But I think as the |
|
FWIW there's also an x-amz-te which comes with put requests. Ignoring those as it is GET and HEAD which are cacheable. |
|
And a put to S3 express with s3 sessions enabled adds x-amz-session-token, which is the key mechanism for permission performance with s3 express. If set, that MUST be retained for performance. It does not change from request to request. |
|
Just cross-referencing my comment from the 15171. I think we're leaning too much on the client to handle the complexity. This should be catalog responsibility. |
|
@danielcweeks look at the top to my alternative proposal; I've discussed this with @adutra and will implement it. In fact if you look at 777cbcf you can see that this was part of my original design: the auth header spitting exists, absent any tests.
Outcome
Does this make sense? Full control in server, with the client adapting to its choices. Polaris can choose what to sign, clients can choose various checksum options, and the only consequence is an increase in cache misses if the signing service signs more headers than before. |
| } | ||
| } | ||
| if (headerEnum == null) { | ||
| LOG.warn("No signed headers found in response: {}", response); |
There was a problem hiding this comment.
let's not print the signature
A bit concerned that the cache is static so spans all signers; for s3a I'd like a prefix in the cache with every fs instance having a different prefix. This allows for different encryption settings per instance
untested and not-wired-up code to enumerateSignedHeaders() from an auth response.
----
AWS: Fix S3V4RestSignerClient cache key to include all request components
The cache key for signed responses only included method, region, and URI, but not headers like `x-amz-content-sha256` that are part of the signature. This caused 403 errors when different content was uploaded to the same URI within the cache TTL.
This fix uses the full `S3SignRequest` as the cache key. This is the only 100% safe option because we cannot know which headers the server will sign and which ones it will ignore; any header included in the signature *must* be part of the cache key.
**This change reduces cache efficiency**; but that's the price to pay for correctness.
----
However, this version retains the SignedComponent design.
What is different is: headers we consider acceptable to not-sign are not attached to the signing request,
hence not to the cached entry.
This means that
* no need to worry about if the signer will/will-not sign those values
* provides the correctness adutra needed: if a signed header changes, the cache will not retrieve a signature
which will be rejected by the S3 endpoint
Add assertions in TestS3RestSigner to verify cache hit/misses as as expected.
2d4911a to
b0691a5
Compare
This just feels like we're overcomplicating the actual use case and how this is beneficial to the client and the service. The only two requests that should be cached are What led to this line of exploration appears to stem from trying to implement Edit: Just to add one more thing. I'm actually quite supportive of improving this in meaningful ways. I just don't think it's necessary to overly complicate this, so maybe there's a middle ground like having the client only scope to We've had multiple versions of production systems running securely with the existing functionality for years, so I'm very hesitant to say there's something fundamentally wrong with the current approach. Changes should be grounded in real observable problems. |
| assertCacheHitsAndMisses(0, 1); | ||
|
|
||
| // the etag is passed in: the same object is returned and the same cached signature is retained. | ||
| // if the ifMatch header was cached, this would have resulted in a failure as there would |
There was a problem hiding this comment.
| // if the ifMatch header was cached, this would have resulted in a failure as there would | |
| // if the ifMatch header was signed, this would have resulted in a failure as there would |
|
|
||
| @Test | ||
| public void validatePutObject() { | ||
| int hits = S3V4RestSignerClient.cacheHits(); |
There was a problem hiding this comment.
| int hits = S3V4RestSignerClient.cacheHits(); |
| int hits = S3V4RestSignerClient.cacheHits(); | ||
| int misses = S3V4RestSignerClient.cacheMisses(); |
There was a problem hiding this comment.
| int hits = S3V4RestSignerClient.cacheHits(); | |
| int misses = S3V4RestSignerClient.cacheMisses(); |
| s3.listObjectsV2(ListObjectsV2Request.builder().bucket(BUCKET).prefix("some/prefix/").build()); | ||
| // list is a GET. | ||
| assertCacheHitsAndMisses(1, 1); | ||
| assertCacheHitsAndMisses(1, 1); |
There was a problem hiding this comment.
| assertCacheHitsAndMisses(1, 1); |
| static final Cache<S3SignRequest, SignedComponent> SIGNED_COMPONENT_CACHE = | ||
| Caffeine.newBuilder().expireAfterWrite(30, TimeUnit.SECONDS).maximumSize(100).build(); | ||
|
|
||
| private static final AtomicInteger CACHE_HITS = new AtomicInteger(); |
There was a problem hiding this comment.
Can't we use Caffeine's .recordStats() instead? And ideally, we would call .recordStats() only when running tests.
There was a problem hiding this comment.
happily. though I was thinking of another stat "entry in cache but header mismatch" as that's also of interest.
| AwsS3V4SignerParams signerParams = | ||
| extractSignerParams(AwsS3V4SignerParams.builder(), executionAttributes).build(); | ||
|
|
||
| // Strip-off headers that should be signed |
There was a problem hiding this comment.
| // Strip-off headers that should be signed | |
| // Strip-off headers that shouldn't be signed |
But this rule isn't enforced. Nothing in the specs prevents a server from sending a
Yes but in fact, the most problematic scenario for me is a But as I said, having now a good understanding of the limitations, I've already adapted Polaris to adhere to the implicit constraints. I'm leaving the ongoing work here for @steveloughran to continue, and will close #15171. |
instead only worrying about filtering of unwanted headers.
tests. -Servlet is in charge of filtering; tests extended
|
@adutra @danielcweeks I've gone back to the servlet-only decision about what to sign, but with an expanded set of headers: more SDK, more "aws-" service and some of the classic http protocol fluff. That's all I care about, as it's enough to reduce the risk that slightly different GET requests will match in the cache but be inconsistently signed. Alex's tests in TestS3V4RestSignerClient are retained, as are my ones in TestS3RestSigner, which expands validation of multipart upload completion and a few more opeations. It doesn't make any assertions on cache hit/miss though, due to the roll-back of I'm personally happy with just the expanded set of stripped headers, which this iteration provides. |
I believe the default is that the client doesn't cache unless told to do so, which makes caching a server responsibility. While it might make sense to limit to just the two methods we expect, it should be the client's responsibility to fix a bad server implementation. Yes, the client would break, but it's really the server that needs to be fixed.
I think that's putting to much control in the client's hands and limits what functionality the server has in deciding what to sign for. If a client "hides" the range header, a server would only have the option to sign for everything or nothing. While in practice, I don't know of any implementation is protecting ranges of files, it is entirely feasible and since it's the servers responsibility to protect the data, it should have the final say on what it allows to be read. |
|
@danielcweeks it'd be hard to read a parquet file with a signing service restricting range access, not least because that service would need to be aware of the layout of the specific .par files. Restricting avro access is possible. But if the service was signing ranges, then it'd have to decide what to do if the range wasn't specified, as otherwise a client would just bypass any restrictions by not passing in the range. Presumably the service would have to reject the request as "incomplete headers" Which is out of scope of the current PR that simply expands the set of ignored headers. It might be useful to add as a response to the API specification though, somehow. |
|
I should add that's where the discussion on parquet dev about having binary files outside the file itself gets interesting. That's mainly driven by the need to manage and ingest large scale BLOBs, but it'd also permit a catalog which was aware of the external files to restrict access to specific users. I hadn't thought of the security aspects of it. |
|
@danielcweeks @adutra this is ready for review; it's so minimal that it shouldn't be controversial. |
Fixes #15417
Expands the set of headers that the
S3V4RestSignerClientshould ignore when signing to includeThe intent is to reduce the risk of different requests having conflicting headers from any cached signature where the cache key is (verb, url, region), as seen in #15166.