Skip main navigation

Azure Well-Architected Framework pillars

Learn how to balance and align the business requirements with the technical capabilities that are needed to execute those requirements.

In the previous step you were introduced to the activity and what the desired learning outcomes are throughout this activity. In this step you will learn about Azure Well-Architected Framework pillars.

The cloud has changed the way organizations solve their business challenges, and how applications and systems are designed. The role of a solution architect is not only to deliver business value through the functional requirements of the application. It’s also to ensure that the solution is designed in ways that are scalable, resilient, efficient, and secure.

Solution architecture is concerned with the planning, design, implementation, and ongoing improvement of a technology system. The architecture of a system must balance and align the business requirements with the technical capabilities that are needed to execute those requirements. The finished architecture is a balance of risk, cost, and capability throughout the system and its components.

Azure Well-Architected Framework

The Azure Well-Architected Framework is a set of guiding tenets to build high-quality solutions on Azure. There’s no one-size-fits-all approach to designing an architecture, but there are some universal concepts that will apply regardless of the architecture, technology, or cloud provider.

These concepts are not all-inclusive, but focusing on them will help you build a reliable, secure, and flexible foundation for your application.

The Azure Well-Architected Framework consists of five pillars:

  • Cost optimization
  • Operational excellence
  • Performance efficiency
  • Reliability
  • Security

Cost optimization

You’ll want to design your cloud environment so that it’s cost-effective for operations and development. Identify inefficiency and waste in cloud spending to ensure you’re spending money where you can make the greatest use of it.

Three arrows pointed up, labelled, Quality, Speed, Efficiency, and one arrow pointed down, labelled Cost.

Your organization has moved a majority of its systems to the cloud, but you’re now seeing cost increases in areas you didn’t expect. After some observation, you realize that you’re inefficient across your environment, and you’re still doing manual operational work.

In this unit, you’ll learn about cost optimization. You’ll also look at ways to reduce unnecessary expenses and improve operational efficiencies.

What is cost optimization?

Cost optimization is ensuring that the money your organization spends is being used to maximum effect. Cloud services provide computing as a utility. Technologies in the cloud are provided under a service model, to be consumed on demand. This on-demand service offering drives a fundamental change that directly affects planning, bookkeeping, and organizing.

When an organization decides to own infrastructure, it buys equipment that goes onto the balance sheet as assets. Because a capital investment was made, accountants categorize this transaction as a capital expense (CapEx). Over time, to account for the assets’ limited useful lifespan, assets are depreciated or amortized.

Cloud services, on the other hand, are categorized as an operating expense (OpEx), because of their consumption model. Under this scheme, there is no asset to amortize. Instead, OpEx has a direct impact on net profit, taxable income, and the associated expenses on the balance sheet.

When an organization adopts a cloud platform, it must shift away from CapEx-oriented budgeting toward OpEx. This move reflects the shift from owning infrastructure to leasing solutions. Some organizations can derive value just from this new accounting model. For example, a startup company can attract investors by demonstrating a profitable idea at large scale, without needing a large investment up front to purchase infrastructure.

To optimize costs in your organization’s architecture, you can use several principles.

Plan and estimate costs

For any cloud project, whether it’s the development of a new application or the migration of an entire datacenter, it’s important to get an estimate of your costs. This estimate involves identifying any current resources to move or redevelop, understanding business objectives that might affect sizing, and selecting the appropriate services for the project.

With the requirements identified, you can use cost-estimation tools to provide a more concise estimate of the resources that would be required. Transparency is important here, so that all stakeholders can review for accuracy and have visibility into the costs that are associated with the project.

Provision with optimization

Provisioning services that are optimized for cost from the outset can reduce your work effort in the future. For example, you should ensure that you’re selecting the appropriate service level for your workload, and take advantage of services that let you adjust the service level. You should also use discounts when they’re available, such as reserved instances and bring-your-own-license offers.

Where possible, you want to move from IaaS to PaaS services. PaaS services typically cost less than IaaS, and they generally reduce your operational costs.

With PaaS services, you don’t have to worry about patching or maintaining VMs, because the cloud provider typically handles those activities. Not all applications can be moved to PaaS, but with the cost savings that PaaS services provide, it’s worth considering.

Use monitoring and analytics to gain cost insights

If you’re not monitoring your spending, you don’t know what you can save. Take advantage of cost management tools and regularly review billing statements to better understand where money is being spent.

Take time to conduct regular cost reviews across services to understand if the expenditure is appropriate for the resource requirements of the workload. Adjust expenditures as necessary. Identify and track down any cost anomalies that might show up on billing statements or through alerts. If you notice a large spike in cost associated with network traffic, it might uncover both cost savings and potential technical issues.

Maximize efficiency of cloud spend

Efficiency is focused on identifying and eliminating unnecessary expenses within your environment. The cloud is a pay-as-you-go service, and avoidable expenses are typically the result of provisioning more capacity than your demand requires. Operational costs can also contribute to unnecessary or inefficient costs. These inefficient operational costs show up as wasted time and increased error. As you design your architecture, identify and eliminate waste across your environment.

Waste can show up in several ways. Let’s look at a few examples:

  • A virtual machine that’s always 90 percent idle
  • Paying for a license included in a virtual machine when a license is already owned
  • Retaining infrequently accessed data on a storage medium optimized for frequent access
  • Manually repeating the build of a non-production environment

In each of these cases, you’re spending more money than you should. Each case presents an opportunity for cost reduction.

As you evaluate your cost, take the opportunity to optimize environments. Capacity demands can and will change over time, and many cloud services can manually or dynamically adjust the provisioned resources to meet the demands. These adjustments can drive the balance between a well-running application and the most cost-effective size.

Optimize your systems at every level. At the network level, ensure that data transfer is efficient and meets the expectations of your customers. Use services to cache data to increase application performance and reduce the transaction load on your data-storage services. Identify and decommission unused resources. Take advantage of lower-cost data-storage tiers to archive infrequently accessed data.

Operational excellence

By taking advantage of modern development practices such as DevOps, you can enable faster development and deployment cycles. You need to have a good monitoring architecture in place so that you can detect failures and problems before they happen or, at a minimum, before your customers notice. Automation is a key aspect of this pillar to remove variance and error while increasing operational agility. Moving just your resources to the cloud is taking advantage of only a small portion of what the cloud can bring to your organization. Along with the technical capabilities that the cloud brings, you can improve your operational capabilities as well. From improving developer agility to improving your visibility into the health and performance of your application, you can use the cloud to improve the operational capabilities of your organization.

What is operational excellence?

Operational excellence is about ensuring that you have full visibility into how your application is running, and ensuring the best experience for your users. Operational excellence includes making your development and release practices more agile, which allows your business to quickly adjust to changes. By improving operational capabilities, you can have faster development and release cycles, and a better experience for your application’s users.

You can use several principles when driving operational excellence through your architecture.

Design, build, and orchestrate with modern practices

Modern architectures should be designed with DevOps and continuous integration in mind. A modern architecture will give you the ability to automate deployments by using infrastructure as code, automate application testing, and build new environments as needed. DevOps is as much cultural as it is technical, but can bring many benefits to organizations that embrace it.

Regardless of whether your project is a greenfield application that uses full continuous integration and continuous deployment (CI/CD) and containers, or if it’s a legacy application that you’re continuing to service, there are DevOps practices you can bring into your organization.

Breaking down silos within an organization is a common thread throughout DevOps. So is working collaboratively across every stage in a project, including change management. Creating a culture of sharing, collaboration, and transparency will bring operational excellence to your organization.

Use monitoring and analytics to gain operational insights

Throughout your architecture, you want to have a thorough monitoring, logging, and instrumentation system. By creating an effective system for monitoring what’s going on in your architecture, you can ensure that you’ll know when something isn’t right before your users are affected. With a comprehensive approach to monitoring, you’ll be able to identify performance issues and cost inefficiencies, correlate events, and gain a greater ability to troubleshoot issues.

Operationally, it’s important to have a robust monitoring strategy. This helps you identify areas of waste, troubleshoot issues, and optimize the performance of your application. A multilayered approach is essential. Gathering data points from components at every layer will help alert you when values are outside acceptable ranges and help you track spending over time.

Use automation to reduce effort and error

You should automate as much of your architecture as possible. The human element is costly, injecting time and error into operational activities. This increased time and error will result in increased operational costs. You can use automation to build, deploy, and administer resources. By automating common activities, you can eliminate the delay in waiting for a human to intervene.


You should include testing in your application deployment and your ongoing operations. A good testing strategy will help you identify issues in your application before it’s deployed, and ensure that dependent services can properly communicate with your application.

A good testing strategy can also help identify performance issues and potential security vulnerabilities in both pre-production and production deployments. A robust testing plan can uncover issues with infrastructure deployments that can affect the user experience, and testing will help you provide a great experience for your users.

Performance efficiency

Imagine that a news story was just published about one of your organization’s recent product announcements. The additional publicity from the news story will undoubtedly bring a large influx of traffic to your website. Will your website be able to handle this traffic increase, or will the additional load cause your site to be slow or unresponsive?

In this unit, you’ll look at some of the basic principles of ensuring outstanding application performance by using scaling and optimization principles that make up the performance efficiency pillar.

What is performance efficiency?

Performance efficiency is matching an application’s available resources with the demand that it’s receiving. Performance efficiency includes scaling resources, identifying and optimizing potential bottlenecks, and optimizing your application code for peak performance.

Let’s look at some patterns and practices that can enhance the scalability and performance of your application.

Scale up and scale out

Compute resources can be scaled in two directions:

  • Scaling up is adding more resources to a single instance. This is also known as vertical scaling.
  • Scaling out is adding more instances. This is also known as horizontal scaling.

Two small computer with an arrow pointing to a six big computer

Scaling up is concerned with adding more resources, such as CPU or memory, to a single instance. This instance might be a virtual machine or a PaaS service.

The act of adding more capacity to the instance increases the resources that are available to your application, but it does come with a limit. Virtual machines are limited to the capacity of the host that they run on, and hosts themselves have physical limitations. Eventually, when you scale up an instance, you can run into these limits. They restrict your ability to add more resources to the instance.

Scaling out is concerned with adding more instances to a service. They can be virtual machines or PaaS services. Instead of adding more capacity by making a single instance more powerful, we add capacity by increasing the total number of instances.

The advantage of scaling out is that you can conceivably scale out forever if you have more machines to add to the architecture. Scaling out requires some type of load distribution. This might be in the form of a load balancer that distributes requests across available servers, or it might be a service-discovery mechanism for identifying active servers to send requests to.

In both types of scaling, resources can be reduced, which brings cost optimization into the picture.

Autoscaling is the process of dynamically allocating resources to match performance requirements. As the volume of work grows, an application might need more resources to maintain the desired performance levels and satisfy service-level agreements (SLAs). As demand slackens and the additional resources are no longer needed, they can be deallocated to minimize costs.

Autoscaling takes advantage of the elasticity of cloud-hosted environments while easing management overhead. It reduces the need for an operator to continually monitor the performance of a system and make decisions about adding or removing resources.

Optimize network performance

When you’re optimizing for performance, you’ll look at network and storage performance to ensure that their levels are within acceptable limits. These performance levels can affect the response time of your application. Selecting the right networking and storage technologies for your architecture will help you ensure that you’re providing the best experience for your consumers.

Adding a messaging layer between services can have a benefit to performance and scalability. A messaging layer creates a buffer so that requests can continue to flow in without error if the receiving application can’t keep up. As the application works through the requests, they’ll be answered in the order in which they were received.

Optimize storage performance

In many large-scale solutions, data is divided into separate partitions that can be managed and accessed separately. The partitioning strategy must be chosen carefully to maximize the benefits while minimizing adverse effects. Partitioning can help improve scalability, reduce contention, and optimize performance.

Use caching in your architecture to help improve performance. Caching is a mechanism to store frequently used data or assets (webpages, images) for faster retrieval. You can use caching at different layers of your application. You can use caching between your application servers and a database in order to decrease data retrieval times.

You can also use caching between your users and your web servers by placing static content closer to users and decreasing the time it takes to return webpages to the users. This has a secondary effect of offloading requests from your database or web servers, increasing the performance for other requests.

Identify performance bottlenecks in your application

Distributed applications and services running in the cloud are complex pieces of software that comprise many moving parts. In a production environment, it’s important to be able to track the way in which users utilize your system, trace resource utilization, and generally monitor the health and performance of your system. You can use this information as a diagnostic aid to detect and correct issues. You can also use this information to help spot potential problems and prevent them from occurring.

Performance optimization will include understanding how the applications themselves are performing. Errors, poorly performing code, and bottlenecks in dependent systems can all be uncovered through an application performance-management tool. Often, these issues might be hidden or obscured for users, developers, and administrators, but they can have an adverse impact on the overall performance of your application.

Look across all layers of your application and identify and remediate performance bottlenecks. These bottlenecks might be poor memory handling in your application, or even the process of adding indexes into your database. It might be an iterative process as you relieve one bottleneck and then uncover another that you were unaware of.

With a thorough approach to performance monitoring, you’ll be able to determine the types of patterns and practices that your architecture will benefit from.

Performance efficiency

For an architecture to perform well and be scalable, it should properly match resource capacity to demand. Traditionally, cloud architectures accomplish this balance by scaling applications dynamically based on activity in the application. Demand for services changes, so it’s important for your architecture to be able to adjust to demand. By designing your architecture with performance and scalability in mind, you’ll provide a great experience for your customers while being cost-effective.


Every architect’s worst fear is having an architecture fail with no way to recover it. A successful cloud environment is designed in a way that anticipates failure at all levels. Part of anticipating failures is designing a system that can recover from a failure within the time that your stakeholders and customers require.


Data is the most valuable piece of your organization’s technical footprint. In this pillar, you’ll focus on securing access to your architecture through authentication and protecting your application and data from network vulnerabilities. You should protect the integrity of your data too, through tools like encryption.

You must think about security throughout for the entire lifecycle of your application, from design and implementation to deployment and operations. The cloud provides protections against a variety of threats, such as network intrusion and DDoS attacks. But you still need to build security into your application, processes, and organizational culture.

General design principles

In addition to each of these pillars, there are some consistent design principles that you should consider throughout your architecture.

  • Enable architectural evolution: No architecture is static. Allow for the evolution of your architecture by taking advantage of new services, tools, and technologies when they’re available.
  • Use data to make decisions: Collect data, analyze it, and use it to make decisions surrounding your architecture. From cost data, to performance, to user load, using data will guide you to make the right choices in your environment.
  • Educate and enable: Cloud technology evolves quickly. Educate your development, operations, and business teams to help them make the right decisions and build solutions to solve business problems. Document and share configurations, decisions, and best practices within your organization.
  • Automate: Automation of manual activities reduces operational costs, minimizes error introduced by manual steps, and provides consistency between environments.

Shared responsibility

Moving to the cloud introduces a model of shared responsibility. In this model, your cloud provider will manage certain aspects of your application, leaving you with the remaining responsibility.

In an on-premises environment, you’re responsible for everything. As you move to infrastructure as a service (IaaS), then to platform as a service (PaaS) and software as a service (SaaS), your cloud provider will take on more of this responsibility.

This shared responsibility will play a role in your architectural decisions, because these decisions can have implications on cost, security, and your application’s technical and operational capabilities. By shifting these responsibilities to your provider, you can focus on bringing value to your business and move away from activities that aren’t a core business function.

Design choices

In an ideal architecture, you would build the most secure, high-performance, highly available, and efficient environment possible. However, as with everything, there are trade-offs.

To build an environment with the highest level of all these pillars, there’s a cost. That cost might be in money, time to deliver, or operational agility. Every organization will have different priorities that will affect the design choices that are made in each pillar. As you design your architecture, you’ll need to determine which trade-offs are acceptable, and which are not.

When you’re building an Azure architecture, there are many considerations to keep in mind. You want your architecture to be secure, scalable, available, and recoverable. To make that possible, you’ll have to make decisions based on cost, organizational priorities, and risk.


In a complex application, any number of things can go wrong at any scale. Individual servers and hard drives can fail. A deployment issue might unintentionally drop all tables in a database. Whole datacenters might become unreachable. A ransomware incident might maliciously encrypt all your data. It’s critical that your application stays reliable and can handle both localized and broad-impact incidents.

Designing for reliability includes maintaining uptime through small-scale incidents and temporary conditions like partial network outages. You can ensure that your application can handle localized failures by integrating high availability into each component of the application and eliminating single points of failure. Such a design also minimizes the impact of infrastructure maintenance. High-availability designs typically aim to eliminate the impact of incidents quickly and automatically, and to ensure that the system can continue to process requests with little to no impact.

Designing for reliability also focuses on recovery from data loss and from larger-scale disasters. Recovery from these types of incidents often involves active intervention, though automated recovery steps can reduce the time needed to recover. These types of incidents might result in some amount of downtime or permanently lost data. Disaster recovery is as much about careful planning as it is about execution.

Including high availability and recoverability in your architecture design protects your business from financial losses that result from downtime and lost data. They ensure that your reputation isn’t negatively affected by a loss of trust from your customers.

Architecting for reliability ensures that your application can meet the commitments you make to your customers. This includes ensuring that your systems are available to end users and can recover from any failures.

Build a highly available architecture

For availability, identify the service-level agreement (SLA) to which you’re committing. Examine the potential high-availability capabilities of your application relative to your SLA, and identify where you have proper coverage and where you’ll need to make improvements. Your goal is to add redundancy to components of the architecture so that you’re less likely to experience an outage.

Examples of high-availability design components include clustering and load balancing:

  • Clustering replaces a single VM with a set of coordinated VMs. When one VM fails or becomes unreachable, services can fail over to another one that can service the requests.
  • Load balancing spreads requests across many instances of a service, detecting failed instances and preventing requests from being routed to them.

Build an architecture that can recover from failure

For recoverability, you should perform an analysis that examines your possible data loss and major downtime scenarios. Your analysis should include an exploration of recovery strategies and the cost/benefit tradeoff for each. This exercise will give you important insight into your organization’s priorities, and help clarify the role of your application. The results should include the application’s:

  • Recovery point objective (RPO): The maximum duration of acceptable data loss. RPO is measured in units of time, not volume. Examples are “30 minutes of data,” “four hours of data,” and so on. RPO is about limiting and recovering from data loss, not data theft.
  • Recovery time objective (RTO): The maximum duration of acceptable downtime, where “downtime” is defined by your specification. For example, if the acceptable downtime duration is eight hours in the event of a disaster, then your RTO is eight hours.

With RPO and RTO defined, you can design backup, restore, replication, and recovery capabilities into your architecture to meet these objectives.

Every cloud provider offers a suite of services and features that you can use to improve your application’s availability and recoverability. When possible, use existing services and best practices, and try to resist creating your own.

Hard drives can fail, datacenters can become unreachable, and hackers can attack. It’s important that you maintain a good reputation with your customers by using availability and recoverability. Availability focuses on maintaining uptime through conditions like network outages, and recoverability focuses on retrieving data after a disaster.


Healthcare organizations store personal and potentially sensitive customer data. Financial institutions store account numbers, balances, and transaction history. Retailers store purchase history, account information, and demographic details of customers. A security incident might expose this sensitive data, which might cause personal embarrassment or financial harm. How do you ensure the integrity of their data and ensure that your systems are secure?

What is security?

Security is ultimately about protecting the data that your organization uses, stores, and transmits. The data that your organization stores or handles is at the heart of your securable assets. This data might be sensitive data about customers, financial information about your organization, or critical line-of-business data that supports your organization. Securing the infrastructure on which the data exists, along with the identities used to access it, is also critically important.

Your data might be subject to additional legal and regulatory requirements, depending on where you’re located, the type of data you’re storing, or the industry in which your application operates.

For instance, in the healthcare industry in the United States, there’s a law called the Health Insurance Portability and Accountability Act (HIPAA). In the financial industry, the Payment Card Industry Data Security Standard is concerned with the handling of credit card data. Organizations that store data that’s in scope for these laws and standards are required to ensure that certain safeguards are in place for the protection of that data. In Europe, the General Data Protection Regulation (GDPR) lays out the rules of how personal data is protected, and defines individuals’ rights related to stored data. Some countries require that certain types of data do not leave their borders.

When a security breach occurs, there can be substantial impacts to the finances and reputation of both organizations and customers. This breaks down the trust that customers are willing to instill in your organization, and can affect the organization’s long-term health.

Defense in depth

A multilayered approach to securing your environment will increase the security posture of your environment. Commonly known as defense in depth, we can break down the layers as follows:

  • Data
  • Applications
  • VM/compute
  • Networking
  • Perimeter
  • Policies and access
  • Physical security

Each layer focuses on a different area where attacks can happen, and creates a depth of protection if one layer fails or is bypassed by an attacker. If you were to focus on just one layer, an attacker would have unfettered access to your environment if they got through this layer.

Addressing security in layers increases the work an attacker must do to gain access to your systems and data. Each layer will have different security controls, technologies, and capabilities that will apply. When you’re identifying the protections to put in place, cost is often of concern. You’ll need to balance cost with business requirements and overall risk to the business.

The different layers of security in a flow diagrame

No single security system, control, or technology will fully protect your architecture. Security is more than just technology; it’s also about people and processes. Creating an environment that looks holistically at security and makes it a requirement by default will help ensure that your organization is as secure as possible.

Protect from common attacks

At each layer, there are some common attacks that you’ll want to protect against. The following list isn’t all-inclusive, but it can give you an idea of how each layer can be attacked and what types of protections you might need.

  • Data layer: Exposing an encryption key or using weak encryption can leave your data vulnerable if unauthorized access occurs.
  • Application layer: Malicious code injection and execution are the hallmarks of application-layer attacks. Common attacks include SQL injection and cross-site scripting (XSS).
  • VM/compute layer: Malware is a common method of attacking an environment, which involves executing malicious code to compromise a system. After malware is present on a system, further attacks that lead to credential exposure and lateral movement throughout the environment can occur.
  • Networking layer: Unnecessary open ports to the internet are a common method of attack. These might include leaving SSH or RDP open to virtual machines. When these protocols are open, they can allow brute-force attacks against your systems as attackers attempt to gain access.
  • Perimeter layer: Denial-of-service (DoS) attacks often happen at this layer. These attacks try to overwhelm network resources, forcing them to go offline or making them incapable of responding to legitimate requests.
  • Policies and access layer: This layer is where authentication occurs for your application. This layer might include modern authentication protocols such as OpenID Connect, OAuth, or Kerberos-based authentication such as Active Directory. The exposure of credentials is a risk at this layer, and it’s important to limit the permissions of identities. You also want to have monitoring in place to look for possible compromised accounts, such as logins coming from unusual places.
  • Physical layer: Unauthorized access to facilities through methods, such as door drafting and theft of security badges, can happen at this layer.

Shared security responsibility

Revisiting the model of shared responsibility, we can reframe this in the context of security. Depending on the type of service you select, some security protections will be built in to the service, while others will remain your responsibility. Careful evaluation of the services and technologies that you select will be necessary, to ensure that you’re providing the proper security controls for your architecture.

Table showing the different types of responsibility

You are almost at the end of this activity. When you are ready click on Mark as complete and then Next to continue on to the Knowledge Check.

This article is from the free online

Designing Infrastructure Solutions with Microsoft Azure Architecture

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now