5XX is something we should never return in a programmatic way. This is maybe in the realm of checked vs unchecked exceptions but I’m more at the good API design realm.
|2XX||I have handled all you requested and here is the result of this operation||Show the result of the requested operation, happy days!|
|4XX||I have handled an exception in my API and I know what it is, here is a bad request response. I might fail fast and return you this bad request response in milliseconds and may not even try any downstream services at all to avoid downfall spiral…||I know I sent a valid request but for some reason(s) API didn’t process my request. I’ll go retry or show an updated result (error screen, retry with new input etc). However, I’ll need to make it harder every time so I will leave some space to API to recover. I’ll also note down this in case needed for future resolutions…|
|5XX||There is something that stops my program flow and it is an exceptional case, my engineers need to look at this and later on either fix this issue or catch it and give a meaningful result (4XX) to my clients if this exceptional case is an expected outcome due to service design… Or there is some big meltdown happening and my API is not even receiving anything about it as the infrastructure or whole service is down…||Something exceptional happening in the API. API doesn’t know what it is, I’m not even sure if my request reached to API, I’ll not retry but rather show something to user that there is something going on that engineers are fixing. I’ll note this down but will not retry after a meaningful time or budget. It is not great but everything fails all the time…|
Leaky Abstractions is your enemy!
In distributed systems, everything fails all the time. Good API design captures failures and handles them gracefully. Returning
5XX to clients will create a false perception. It will also leak the abstractions. Knowing that APIs are here forever, leaking an abstraction will create a dependency on clients that will be just a tech debt from day 1.
To explain this further, knowing that my DB fails time to time (hardware ceiling, SLAs, downstream issues, tech debt etc.) and if I return
5XX in those occasions to my consumers, the consumer of my API will accept this as a design decision, assuming how this API works in certain way (leaking the abstraction). Client may implement some logic that every time
5XX means API might have failed and retry the request. However, while the system is down, sending more traffic will only create further bigger issues. Returning
5XX will not only create a leaky abstraction but also it will always escalate the system issues to further levels that is hard to recover. If API would actually design the failure gracefully and return
4XX, it could create an abstract system that could put these requests in an expiring queue (i.e. expire the requests in the next 5 mins) so that even if the client keeps trying, I’ll have a way to catch up. From Client perspective, it is abstract, reliable and predictable API behaviour. Also, because API is able to handle DB failure and can put the upcoming requests in a queue while DB is recovering, it will be performant and fast too…
In this sense, clearly, If an API is up & running but only a dependency is failing, it is
5XX is when the API is broken for a reason that we cannot capture the details at all. If we cannot write to db due to DynamoDB service is down, it is
4XX because as an API, I know what part of my dependency is failing and I know how to catchup. (put failed requests to SQS and retry later on etc). If API Gateway is down, without that request is ever reaching to my API, I know AWS will throw
5XX. If Lambda service is not able to run my function code because I have a defect in my code pipeline due to wrong compilation, missing a package in dependencies, wrong packaging etc, I know Lambda will throw
As engineer, in above cases, you will immediately know that there is a
5XX going on in my system and I’ll know where to look up. Otherwise, for every
5XX, I’ll need to search through logs to see if
5XX is an API response or an exception in the distributed system or at AWS level…
This perspective will also lead meaningful alerts and observability. Alerts will be thrown at the right level and in exceptional cases. Otherwise, returning
5XX programmatically will create a complex benchmark data for observability and alerting. Deciding when there is more than 0
5XX is much more cleaner operational action than constantly tweaking
5XX levels in my monitoring system and dashboards. As an engineer, I need to look at under the hood only when there is a real issue,
5XX and not when an arbitrary programming decision in the control flow throws
5XX like when a field is missing in the request and API returning
To conclude this opinion piece,
5XX is something we should see in rare times as exceptional issues and fixed at infrastructure, system design or wider distributed system levels. Anything above should be handled by the program flow and should be
4XX with a meaningful message to API consumers. Failing fast, gracefully handling failure, having a plan to recover from dependency failures are hard things at API design but when invested early and carefully planned as part of the implementation, it will earn trust, help running the API more reliable and will lead happier engineers & operations.
What do you think?