Architecting Distributed Apps for Azure – Part 1


Good knowledge of Architecting Distributed Cloud Applications is something every cloud developer must learn and master. This blog series is for those Cloud Developers and Cloud Architects who are considering or actively working or even planning to work on a distributed cloud application.

This Blog Series is written in such a way so that it can provide a detailed understanding of distributed application concepts, the various advantages and disadvantages of specific technologies, and the cloud design patterns that are heavily used by distributed cloud applications. This blog series will help you to easily build efficient and fault-tolerant cloud applications.

Part 1 – Basics of Distributed Cloud Applications ,Fault tolerance,  Micro-services and Containers

Why Cloud Computing ?


Cloud computing is a big shift from the traditional way businesses think about IT resources. Why is cloud computing so popular? Here are 6 common reasons organisations are turning to cloud computing services:
  • Productivity: Cloud based applications gains productivity because of availability of well proven and globally scalable Cloud Components which allows developers and architects to focus on Business Logic rather than reinventing the wheel.
  • Cost: Cloud computing eliminates the capital expense of buying hardware and software then setting up and running on-site datacenters. We can scale up and scale down the number of instances as per our need and it saves cost.
  • Speed: Most cloud computing services are provided self service and on demand, so even vast amounts of computing resources can be provisioned in minutes, typically with just a few mouse clicks, giving businesses flexibility and taking the pressure off capacity planning.
  • Global scale: The benefits of cloud computing services include the ability to scale elastically. In cloud speak, that means delivering the right amount of IT resources—for example, more or less computing power, storage, bandwidth—right when it’s needed, and from the right geographic location.
  • Performance: The biggest cloud computing services run on a worldwide network of secure datacenters that are regularly upgraded to the latest generation of fast and efficient computing hardware. This offers several benefits over a single corporate datacenter, including reduced network latency for applications and greater economies of scale.
  • Reliability: Cloud computing makes data backup, disaster recovery, and business continuity easier and less expensive, because data can be mirrored at multiple redundant sites on the cloud provider’s network.

While cloud computing offers a number of benefits, building applications for the cloud does require a different approach and mindset.

Ways to run your existing/new applications on Azure Cloud

  1. Existing Applications:
  2. Cloud Infra Ready Apps:
  3. Cloud DevOps Ready Apps:
  4. Cloud Optimized Apps:


(Image Source : MSDN Documentation)

Design applications to be self-healing

Design an application to be self healing when failures occur. This requires a three-pronged approach:

  • Detect failures.
  • Respond to failures gracefully.
  • Log and monitor failures to give operational insight.

How you respond to a particular type of failure may depend on your application’s availability requirements.


  • Retry failed operations. Transient failures may occur due to momentary loss of network connectivity, a dropped database connection, or a timeout when a service is busy. Build retry logic into your application to handle transient failures.
  • Protect failing remote services (Circuit Breaker). It’s good to retry after a transient failure, but if the failure persists, you can end up with too many callers hitting a failing service. This can lead to cascading failures as requests back up. Use the Circuit Breaker Pattern to fail fast (without making the remote call) when an operation is likely to fail.
  • Isolate critical resources (Bulkhead). Failures in one subsystem can sometimes cascade. This can happen if a failure causes some resources, such as threads or sockets, to not be freed in a timely manner, leading to resource exhaustion. To avoid this, partition a system into isolated groups, so that a failure in one partition does not bring down the entire system.
  • Perform load leveling. Applications may experience sudden spikes in traffic that can overwhelm services on the backend. To avoid this, use the Queue-Based Load Leveling Pattern to queue work items to run asynchronously. The queue acts as a buffer that smooths out peaks in the load.
  • Fail over. If an instance can’t be reached, fail over to another instance. For things that are stateless, like a web server, put several instances behind a load balancer or traffic manager. For things that store state, like a database, use replicas and fail over. Depending on the data store and how it replicates, this may require the application to deal with eventual consistency.
  • Compensate for failed transactions. In general, avoid distributed transactions, as they require coordination across services and resources. Instead, compose an operation from smaller individual transactions. If the operation fails midway through, use Compensating Transactions to undo any step that already completed.
  • Checkpoint long-running transactions. Checkpoints can provide resiliency if a long-running operation fails. When the operation restarts (for example, it is picked up by another VM), it can be resumed from the last checkpoint.
  • Degrade gracefully. Sometimes you can’t work around a problem, but you can provide reduced functionality that is still useful. Consider an application that shows a catalog of books. If the application can’t retrieve the thumbnail image for the cover, it might show a placeholder image. Entire subsystems might be noncritical for the application. For example, in an e-commerce site, showing product recommendations is probably less critical than processing orders.
  • Throttle clients. Sometimes a small number of users create excessive load, which can reduce your application’s availability for other users. In this situation, throttle the client for a certain period of time. See the Throttling Pattern for more information.
  • Block bad actors. Just because you throttle a client, it doesn’t mean the client was acting maliciously. It just means the client exceeded their service quota. But if a client consistently exceeds their quota or otherwise behaves badly, you might block them. Define an out-of-band process for the user to request getting unblocked.
  • Use leader election. When you need to coordinate a task, use leader election to select a coordinator. That way, the coordinator is not a single point of failure. If the coordinator fails, a new one is selected. Rather than implement a leader election algorithm from scratch, consider an off-the-shelf solution such as Zookeeper.
  • Test with fault injection. All too often, the success path is well tested but not the failure path. A system could run in production for a long time before a failure path is exercised. Use fault injection to test the resiliency of the system to failures, either by triggering actual failures or by simulating them.
  • Embrace chaos engineering. Chaos engineering extends the notion of fault injection, by randomly injecting failures or abnormal conditions into production instances.


Orchestrators automate the lifecycle and management of systems and services. The lifecycle consists of creating and destroying the machines (virtual or physical), monitoring the health of the deployed resources, as well as the deployment and running of the service code. The orchestrator also manages the network, so the communication between your machines is isolated, coordinates the distribution of bandwidth between different tenants, and so on.

Cloud vendors use orchestrators for providing varying levels of automation and services; Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Container as a Service, even Function as a Service. Cloud vendor’s orchestrators are exposed through an API.

Orchestration is fairly broad and is often used to describe cluster management, scheduling of compute tasks, and the provisioning and deprovisioning of resources.

Orchestration will often consists of:

  • Provisioning is the process of bringing new resources online and getting those resources ready to perform work. This may be creating a new virtual machine and setting up the operating system, creating and configuring the network, and so on.
  • Cluster management involves sending tasks to machines, adding and removing machines, and managing active processes.
  • Scheduling is the process of running a specific service on specific machines in a cluster.

In some cases multiple orchestrators will be used. A cloud vendor’s orchestor may manage the provisioning of managed services and servers when deploying or scaling a cluster, and a cluster orchestrator will be used to manage containers and processes running inside the cluster.

Regions, availability zones, and fault domains

Cloud vendors generally build out infrastructure around regions and availability zones. This allows for highly available application deployments that can be located near the clients consuming the applications.

A region, is a geographical region on the planet, potentially multiple datacenters in close proximity, networked together. Those datacenters are sometimes called availability zones. An availability zone, has its own independent power and networking. It is set up to be an isolation boundary. If one availability zone goes down, the other continues working. The availability zones are typically connected to each other through very fast, private fiber-optic networks.

Within the availability zone, the VMs are deployed on machines, that are organized in racks. Each rack has its own router. The virtual machines on one single physical machine may run multiple containers.

When an incoming request comes to the endpoint, it is usually first delivered to a load balancer to route the traffic to an instance of a service. The goal is to run the code on different VMs that are not close to each other to reduce the chance of single point of failure. The unit of single point of failure is called a fault domain. With this hierarchy, when:

  • a region goes down, everything inside the region is down.
  • an availability zone goes down, everything inside the availability zone is lost.
  • a rack goes down, it is the PCs that are lost.
  • a PC goes down, it is the VMs on it that are lost.

You can ensure more robust services by deploying them across this fault domain hierarchy. As you move up the hierarchy, the likelihood of losing the fault domain gets smaller, that is, losing the whole region is less likely than losing a PC. If you are concerned a region could be a single point of failure, then you can consider distributing services to multiple regions. Intraservice communication is typically used for data replication. Some of those services will need to replicate data, especially if the instances are deployed to different fault domains. As you go up this hierarchy, the latency between those fault domains will increase. For example, the latency for replicating data between two regions is larger than between two availability zones. From another perspective, if you are replicating between two VMs, the communication is very fast, however, losing the PC is more likely.

Next Steps : Move to Part 2


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: