Architecting Distributed Apps for Azure – Part 2

The practice of developing distributed systems is an old one in the software industry. Over the years, some common assumptions were made, which later turned out to be false. Peter Deutsch originally drafted 7 assumptions that systems architects make in 1994. Later, James Gosling added the 8th, and now they are called the 8 fallacies of distributed computing Let’s walk through them:

  • The network is reliable: Reality is, it’s not, due to many factors from hardware to software. The designer of the app needs to consider this, and should implement error handling and retries for any operation that reaches another service across the wire. When designing applications, we need to think about retries and implement them.
  • Latency is zero: Many solutions that perform very well on local systems during development may experience additional latency in other environments, such as test or production. The more requests an application makes (chatty applications) the more latency becomes an issue. It’s common that legacy on-premises applications migrated to the cloud will experience performance issues, because the application is too chatty. This is not necessarily due to bad architecture or design, it’s just that on-premises applications are often designeded with a different set of assumptions, like network latency is under a specific time.
  • Bandwidth is infinite: It is a fact that bandwidth has been improving. However, it is not infinite. You should also note that you will be deploying the services to multitenant environments, where the service provider will be throttling your requests to overcome the noisy neighbor problem. As the bandwidth is not infinite, it dramatically changes over the lifetime of the communication. The applications need to send small payloads. This clearly brings up the classic case of tradeoff, between bandwidth and latency.
  • The network is secure: The network is usually a resource the developers of the application cannot control. The application cannot assume the network is secure. The application must secure its data, and authenticate all requests.
  • Topology does not change: It’s not safe to say that a specific service is always in the same location, that traffic will take the same route, or that latency will not change. Applications should not make assumptions about the location of critical components in the infrastructure. A standard way to decouple applications from the location is to use DNS.
  • There is one administrator: A distribted system is a complex one, often managed by a wide variety of teams with sometimes conflicting priorities. A good example is having an IIS application requiring full trust on the server, where the server is also performing other tasks that contradict this requirement. The application might be working on the development or test environment, but that may not be the case for the production. The designer of the application should always remember other teams might invalidate assumptions. As administrators, we are also depending on the services of a nonhuman administrator, the orchestrator and its human administrators. The application should also create appropriate diagnostics and monitoring data to help the administrators manage the system.
  • Transport cost is zero: There are two ways to interpret this, one is the actual data pushed by the application incur costs in terms of compute power and latency (think SSL). Another is the equipment that carries the traffic between two points costs money. For example, a system running locally with small amounts of data might seem inexpensive. However when it is deployed on a cloud platform, providers charge for network traffic coming out of (and sometimes into) the datacenter. In addition, with real data, the transport requirement may become so large that it requires a costlier, dedicated Internet connection to the cloud provider’s datacenter.
  • The network is homogenous: Do not depend on proprietary protocols or transports. The transport mechanism is a resource the application designer does not control, and changes to it will invalidate the assumptions on a proprietary protocol choice. Stay interoperable through standard protocols.


Forward and reverse Proxies


There are two types of proxies, reverse proxies and forward proxies. Forward proxies have features to support client , and reverse proxies have features to support services.

In this diagram we see two servers exposing a service through a reverse proxy. Reverse proxies perform some of the following tasks:

  • Load balancing
  • Routing
  • SSL termination
  • Caching
  • Authentication/validation
  • Tenant throttling/billing
  • Some DDoS mitigation
  • Aggregation
  • Monitoring

Forward proxies process outgoing requests. A company may setup a proxy, and force all of its outgoing traffic through it. For example:

  • Caching
  • Client anonymization
  • Monitoring
  • Routing

A carefully-designed RESTful web API defines the resources, relationships, and navigation schemes that are accessible to client applications. When you implement and deploy a web API, you should consider the physical requirements of the environment hosting the web API and the way in which the web API is constructed rather than the logical structure of the data.

Circuit breaking

In a distributed environment, calls to remote resources and services can fail due to transient faults, such as slow network connections, timeouts, or the resources being overcommitted or temporarily unavailable. These faults typically correct themselves after a short period of time, and a robust cloud application should be prepared to handle them by using a strategy such as the Retry pattern.

However, there can also be situations where faults are due to unanticipated events, and that might take much longer to fix. These faults can range in severity from a partial loss of connectivity to the complete failure of a service. In these situations it might be pointless for an application to continually retry an operation that is unlikely to succeed, and instead the application should quickly accept that the operation has failed and handle this failure accordingly.

Additionally, if a service is very busy, failure in one part of the system might lead to cascading failures. For example, an operation that invokes a service could be configured to implement a timeout, and reply with a failure message if the service fails to respond within this period.

The Circuit Breaker pattern can prevent an application from repeatedly trying to execute an operation that’s likely to fail. Allowing it to continue without waiting for the fault to be fixed or wasting CPU cycles while it determines that the fault is long lasting. The Circuit Breaker pattern also enables an application to detect whether the fault has been resolved. If the problem appears to have been fixed, the application can try to invoke the operation.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.