The Up-and-Running Guide to Architectural Fitness Functions

Solving the question: How long is a piece of string? Or many? Photo by Patrycja Chociej on Unsplash

The Up-and-Running Guide to Architectural Fitness Functions

What you need to know about them without any mysteries left to uncover.

You’ve heard about architectural fitness functions, but most of the material you find is missing concrete details? Let’s fix that.

I’ll explain:

What architectural fitness functions are;
What they help you do;
How they relate to other test types and architecture practices;
How to build them and use them;
Where to find relevant tooling for fitness functions;
Numerous examples of fitness functions;
How the open-source tool ArchFit can help you get up to speed very quickly with a set of ready-to-use fitness functions.

This will hopefully provide a practical runway for you, so I’ll stick in a bunch of additional reading and viewing for you too, so you have something to do once you want to level up.

Let’s start with the problem before we get to the concept itself.

The problem is, we are not used to testing architecture — just code

It’s non-trivial for most organizations to succinctly define what they want their software to do — knowing what it should do and then actually making that happen. You can express this in many ways, short and sweet or book-length, but that struggle is what I believe is the simple truth at the center of many of our professional lives. And when you think you’ve got the handle on that part of the equation, having put together an idea for a solution, then comes the next thing storming in:

How we do know that the grand solution, the architecture, is any good?

One of the ways to know this is by using architectural fitness functions.

What architectural fitness functions are

In Building Evolutionary Architectures, architectural fitness functions are defined as

any mechanism that performs an objective integrity assessment of some architecture characteristic or combination of architecture characteristics.

You can therefore express fitness functions almost any way you like, from a monitoring query to actual coded unit tests. They can exist in a range of dimensions, such as proactive automated measures or manually run detective toolsets. Do you run any linting tools today? Security scanning? If so, that’s a type of fitness function right there.

Anything you can measure in an objective manner is fair game. While some metrics overlap with conventional wisdom from adjacent fields (testing, monitoring/observability, manual reviews…), fitness functions are explicitly intended to measure architectural concerns and characteristics rather than the logically correct behavior of the solution. This is a unique aspect of fitness functions—the similarity with other fields comes because their data informs signals we’ve always cared about when we need to understand our architectures. The subject of the test is not, say, a logical, deterministic function any longer. Instead, it’s perhaps resource utilization facts, a list of blob storage buckets with public visibility, security vulnerability enumerations, or other similar data.

Measurement is not enough, however. Failure to meet the fitness function’s expectation needs to lead to an output, such as a test run failure, blocked deployment, alert being triggered to the team or some other concrete effect. This is one way in which we differentiate fitness functions from traditional monitoring.

Fitness functions can measure single (atomic) or multiple (holistic) factors at once. They can also be either automated or manual, depending on the circumstances and what makes sense for a given test. As always, though, automation is a virtue we should strive towards, if at all possible.

We’ll go into some of the concepts and theory very soon, but I’ll cut to the chase for a moment and show you a minimum viable example so you have some practical grounding to start with.

Making a minimum viable fitness function

For our “vertical slice” example, let’s say you have built a small application: It’s an Express server running on a low-tier virtual machine in one of the popular cloud providers. The application simply returns back the current GMT (UTC +0) time.

The application has tests:

A few unit tests to ensure logically correct behavior.
A simple smoke/integration test to check that the deployed server works and responds with a status of 200.

You can go overboard with more testing types too, but it’s fair given the simplicity of the example to say that we deserve our confidence in the coded solution itself actually working. What we don’t know is if the architecture is any good.

As an architect, you have to care about properties of software that may seem more abstract to others than the code is (which is indeed highly concrete and present). You care about the software’s -ilities—resiliency, usability, quality, performance efficiency, security, and many more such properties. These are manifest (or absent) from the software but require tools to measure that not all are aware of. Not even architects, always! Every architecture is also by necessity a compromise between these qualities which can’t always co-exist—these are the trade-offs architects have to deal with; see Software Architecture: The Hard Parts: Modern Trade-Off Analyses for Distributed Architecture for a lot more on this.

So as the joke/saying/sad truth goes,

“You’re not aiming for the best architecture, you are content with having the least bad one”.

For this specific system, what matters to us?

If I allow myself to shortcut a bit, let’s pull out a simple, generic one: We care about the latency of the response. Correctness is also important to a service like this but is hard to validate with something as non-deterministic as time. However, we know that security and privacy are not high on our priority list for a public service like this, so let’s go with latency.

The basic components for our primitive fitness function will be:

Metric(s) or data: The number of requests and the request latencies for those requests.
Function: The fitness function will calculate the average response time of the web application’s endpoint.
Target: We will set a static maximum allowed latency value, such as 200 milliseconds. Anything under that value will be OK, anything over will be a failure.

We could easily implement the above as a dashboard with an alarm in our cloud monitoring tool (AWS CloudWatch, Datadog, New Relic…) that’s triggered every time we have an average latency over the threshold value. Latency also happens to be a pretty uncomplicated numeric fact, which means we can create averages like we just did. Note that given the detail and quality of your data, you can definitely do p95 or other non-average slicing too. Remember, we’re keeping it simple here.

This is not bad, but it’s not ideal either. Stuck in the monitoring tool, we can’t really think of it as a ”function” of the system (in the way we think of “everything-as-code”) and the context starts concerning operational, rather than architectural, parameters. Monitoring is detail-oriented and isn’t concerned about the architecture (because that’s not what it monitors) — but it will be a (valuable) input to discussions on architectural properties. So keep doing all of that stuff!

In other words: The signal is right but not the context, and inferring outcomes (such as bad latency) from signals packaged for another context won’t make your life any less complicated. What’s good is that our team will know something is up with the solution, but it’s not an architectural fact at this point.

We’ll write some pseudo code for a binary, atomic fitness function—it will only ever be either OK or not OK. I am a simple man and enjoy simple things, booleans being one of them. It’s smart to think of fitness functions as any other modern test—as code collocated with the system.

// Pseudo code
MAX_ACCEPTABLE_LATENCY = 200

function calculateAverageLatency(responseTimes, responseCount):
 return responseTimes / responseCount

function calculateFitness(averageLatency):
 if (averageLatency <= MAX_ACCEPTABLE_LATENCY) return true
 return false

function evaluateArchitecture():
 averageResponseTime = calculateAverageLatency()
 isFit = calculateFitness(averageLatency)
 return isFit

In a cloud environment, say AWS, you can collect this data from core tools like CloudWatch and CloudTrail, just like we did above in our monitoring implementation.

The result of this function is a simple answer to the question “is my architecture as fast as I expected it to be?”. The function can be automated to run continuously, as part of CI, or as an inspection tool. Functionally it’s very close to the alarm, but having it as code means we have a vast degree of flexibility in how to employ it in several dimensions, for test data or actual production systems, and work explicitly towards better architectures.

Now, if we have a hypothesis that a different size VM will improve latency, we could do a canary deployment and traffic shift a limited set of production traffic to the new VM, fire off the fitness function, and decide automatically if we have met our goals. Programmability is a big shift over the operations-facing alarms. Both are needed, they might overlap in context, but they solve different problems.

With this, we now have a basic fitness function that verifies a key architectural property of our system. But there are so many more, in the real world!

How fitness function can help shift architecture left

Fitness is the property of being suitable for some purpose.

Architectural fitness functions were made popular together with the concept of evolutionary architecture by Patrick Kua et. al., but have been present in fields outside of software engineering for several decades.

Grady Booch wrote that

All architecture is design, but not all design is architecture. Architecture represents the set of significant design decisions that shape the form and the function of a system, where significant is measured by cost of change.

The corollary is that all software has some “shape” and properties. If it’s intentional we may call this architecture. For the software to support its intended use, we as its creators have to make choices on what it should excel at, and what it is allowed to be less concerned about.

Something that’s probably haunted many software architects is how “cross-functional requirements” (also called non-functional requirements)— or an architecture’s quality attributes — have the propensity to disappear, become down-prioritized, or otherwise “never happen”.

Commonly, the solution is to rely on architectural reviews or change boards, whose role is to assess and validate a solution and to create a social context in which to later reassess the work as needed. While architecture reviews and change boards may enforce compliance and adherence to a varying degree, they’ve also proven to create chokepoints and less agile conditions in software organizations.

I believe there are a few reasons this happens (and there are certainly more!):

The business/product/project owners are not getting crisp details on how poor architecture is harming them and the system, thus their decision-making powers cannot be exercised adequately for the right issues. This may lead to mismanagement, harmful decisions, and overruns.
Developers have a poor understanding of architectural concerns, how these are manifested, and how architecture supports the developers’ part: Building the software according to the sound principles and outlines. Poor understanding could lead to consequences like flawed execution, team friction, and an overall bad workplace with high attrition.
Architects may lack the ability to identify and educate about concrete, objective evidence of an architecture possessing (or not possessing) the qualities it should have; therefore they “don’t scale” well, resulting in tedious manual processes and the continued myth about the ivory tower architect.

Further, these reviews move the sense of responsibility from the producers (the team, the developers) to someone else outside of that team. This can even create a vicious cycle of acrimony between both parties because one party complains and the other might not have all the facts to do the right work. In short, we should do better than this format.

Just giving up, as there are stories about, and blasting off like a gung-ho guns-blazing extremist agilista and foregoing any reviews or talk of architecture is, of course, also fraught with its own issues. I’ve seen this happen and it’s never a pretty sight. And I endorse agile development! But this way doesn’t work out.

If we consider fitness as a property of subjects within an evolutionary system—which most definitely software systems are an example of—then we’ve already seen how fitness can be used as a metric (or degree) to which parts of the system align to their ideal purpose.

In software, as with everything, meaningful change can only come from feedback: knowing and understanding what’s going on with our systems. The faster and more accurate the feedback, the better our chances to course-correct and adjust to better results (whatever that means in a given context). This works regardless of the performance or result we attain:

If the feedback shows we are in a negative state, we can identify, adjust and optimize what ails us.
If the feedback shows we are doing well, we can spend time elsewhere.

The fitness functions thus present objective facts about the architecture so we can make informed decisions on what to keep (evolve) and what to lose (devolve).

From one-off reviews to continuous feedback

The question turns to where and how to validate a solution, if not with such mechanisms.

Most teams and organizations have the concept of Definition of Done or something similar. Ensuring adherence to standards and expectations is already in teams’ CI/CD pipelines. Pierre Pureur, one of the writers of Continuous Architecture in Practice, argues the pipeline is a good point from which to make the architecture assessment:

The most effective way to automate the [Definition of Done] is to build it into the team’s automated continuous delivery (CD) pipeline, supplemented by longer-running quality tests that help to evaluate the fitness for purpose of the [Minimum Viable Architecture].

This addresses several possible dimensions of fitness functions, and we’ll see more about that later in this article.

If we go with Pureur’s approach, we now have two problems solved:

The location problem, as this happens during CI/CD.
The ownership problem, as it’s most likely that these tests should be owned by the team owning the system.

Next is understanding what we want to measure and what data is available: What data or evidence is there to support an assessment (fitness function) to either portray a positive or negative state? Unsurprisingly, this is the same situation you get with any old testing (unit, integration…) you’ve tested before, but the twist here is that it may feel a tad more abstract with fitness functions, but really, it’s not. Data sources could include any meaningful and truthful mix of:

Logs
Custom file content parsing of source code
Abstract Syntax Tree (AST) parsing of source code
Cloud vendor APIs
Third-party (SaaS etc.) APIs

These are examples, you could have a bunch more sources to use!

As with any data or assessments, the quality and validity of an assessment made with these functions are dependent on accurately gauging the subject. If the data is bad, or the assessment is misconstrued, then the fitness function will effectively not test anything of value at all. It may even give misleading information.

The simpler the tests are, and the more trustworthy they are, the better. Don’t forget that they need to have an actual effect on those who build solutions. If fitness functions fail or return with negative results but there is no corrective action taken, then you have an adherence and process problem.

So, where do traditional reviews come in? I still think they have their role as a social, intellectual, and competence-oriented instrument. We can’t easily infer intent from software either, so setting a team up for success is still a valid proposition, and giving rich feedback will always have a place.

However, instead of the unclear outcomes we may recall from our own reviews, a concrete output of a review could be the requirement to set up some number of fitness functions to run while developing. You could also use fitness functions to create high water marks or benchmarks that you target. If you’re in an early stage of development — there might not even be any code at all yet — then you could look at making fitness functions that test the hypotheses of your system. Once the groundwork is laid and there is some clarity in the solutioning phase, then it should be possible to define and deploy fitness functions to give continuous feedback to the team.

Also: Don’t miss Mark Richards’s YouTube channel where he’s recorded several videos on fitness functions.

Dimensions of fitness functions

Fitness functions can have different dimensions, to correspond to their use and temporal location in the software development process. The high-level split goes like this:

Atomic fitness functions are those that interact with clear, precise, singular contexts. Examples of this include unit testing, linting, and other static analysis. Generally, these are things that can be done relatively straightforwardly, and even using test data. Whatever can be thought of as binary and absolute, like guardrails, you probably want to try to push towards preventive/proactive static analysis. Holistic fitness functions, conversely, look at the “bigger picture” which implies a shared context — several things interacting to produce the result you are looking to assess. For the dynamic functions, this is where you’ll have to consider behavior and anything that’s not possible to know at (or before) deployment time. Many of the “true” performance-related data points (e.g. latency, memory use, security) will be of the holistic type.

The next dimension concerns the scope of a fitness function: Static fitness functions end up with fixed or binary results, while dynamic fitness functions need additional context to produce a result that might be graded or on a scale, rather than a basic pass/fail.

As with many things in computing, fitness functions can be either automated or manually run. It’s a truism that automation is often worth the time and energy to set up, but I’ll leave that argumentation out here — consider that argument solid, and made convincingly by others already. However, because the architecture process may also be inspective or detective in nature — for example, as a part of reviews, or as an ad hoc mechanism, e.g. fitness functions “on-demand” — and since some factors may be very hard to automate in a safe manner, it’s well worth accepting manual fitness function tests to a higher degree than other, more conventional tests.

Which leads us to triggered fitness functions. The above examples are just that, while continuous fitness functions check… well… continuously over time. Think of how monitoring works. We can run such functions on a pre-deployment environment (or ephemeral environment) or after deployment, on the actual production infrastructure. There are logically valid use cases for both types and generally anything that is dependent on real data would be a continuous concern, while that which you can check statically can be analyzed with a triggered fitness function. Also, though I’m not sure it’s entirely correct, having these tests in your CI pipeline — technically being triggered — they do work in a continuous fashion since they run every time anything has changed in the code.

I will even propose that you can further mentally package these dimensions into modes that are applicable for distinct scenarios:

Reactive: Holistic, dynamic, continuous functions
Proactive: Static, triggered functions
Detective: Any manual functions

That makes sense to me, and maybe it will to you, too!

Examples of architectural fitness functions

Note that the categorizations underneath the example headings are suggestions, as they become what you make of them.

Example 0: Latency and performance⌛️

Holistic | Automated | Triggered or Continuous | Static or Dynamic

Data source examples: APIs for compute services and API Gateway; APIs for utilization (e.g. CloudWatch)

Back to where we started!

Likely the crispest and clearest concept to test is latency. This might make sense to test when, for example, making changes to the configuration of your servers/functions.

Relatedly, you could probably use static analysis to check configurations for over- or under-provisioning. The problem is that it would (at least currently, in 2023) follow a static threshold (some values allowed; some not), rather than understand how it actually works for your workload. For Lambda functions, you can use Lambda Power Tools to profile your code across a range of configurations to find the best fit in performance and cost.

Still, to validate the actual real-life fitness of your solution, you’ll need to have a fitness function for it that runs on production data.

You can easily expand this area with concerns including:

Partial factors, such as disk use or network capacity
Cold starts
Concurrency

Example 1: Sustainability 🌱

Holistic | Automated | Continuous | Dynamic

Data source examples: APIs for compute services and API Gateway; APIs for billing

Utilization is one of the leading factors you can work with when making a more sustainable solution. If you’re using virtual machines, you can consider a model with better performance-per-watt (e.g. ARM) or simply choose a lower-tier model for a better (higher) consistent utilization rate.

You could also create a fitness function to see if there are consistent gaps between requests to a system. Say that you have discovered there is no traffic during certain hours, then you could implement a scaling mechanism or even completely turn off servers during these hours. This specific mechanism would not be part of the fitness function, but the information needed to know this fact and to make an accurate decision would be part of the function’s job.

# Pseudo code example
function calculateFitness(requestIntervals, noTrafficHours):
 consistentGaps = true
 
 for i = 1 to length(requestIntervals) - 1:
 timeGap = requestIntervals[i] - requestIntervals[i - 1]
 
 if timeGap > MAX_ALLOWED_GAP:
 consistentGaps = false
 break
 
 if consistentGaps:
 for hour in noTrafficHours:
 if hour in requestIntervals:
 return FITNESS_HIGH
 return FITNESS_LOW
 
 return FITNESS_LOW

# Example input data
requestIntervals = [10, 20, 30, 60, 70]
noTrafficHours = [2, 3, 4]

# Constants
MAX_ALLOWED_GAP = 15
FITNESS_HIGH = 1.0
FITNESS_LOW = 0.0

# Calculate fitness
fitnessScore = calculateFitness(requestIntervals, noTrafficHours)
print("Fitness Score:", fitnessScore)

Lastly, you could also dig into your coffers and get a service like GaiaGen to calculate this for you.

Gaia Generation
Are you running a Dirty Application?www.gaiagen.eu # Pseudo code example
function calculateFitness(messageCounts, totalEvents):
totalMessages = sum(messageCounts)
droppedMessages = totalEvents - totalMessages

if droppedMessages == 0:
return FITNESS_HIGH
else:
return FITNESS_LOW

# Example input data
messageCounts = [100, 150, 200]
totalEvents = 500

# Constants
FITNESS_HIGH = 1.0
FITNESS_LOW = 0.0

# Calculate fitness
fitnessScore = calculateFitness(messageCounts, totalEvents)
print("Fitness Score:", fitnessScore)

Example 3: Error rate (i.e. working as intended) ❌

Holistic | Automated | Continuous | Dynamic

Data source examples: APIs for compute services and API Gateway

This one is very close to being a monitoring/observability issue, but I’ll add it anyway. We want the solution to actually work as intended. While the development and testing should take care of that theoretically, it might be worth checking to see that there is no—or only a minimal degree—of unhandled errors (500-class errors for an API).

# Pseudo code example
function calculateFitness(errorCount, totalCount):
 errorPercentage = (errorCount / totalCount) * 100
 
 if errorPercentage <= 1.0:
 return FITNESS_HIGH
 else:
 return FITNESS_LOW

# Example input data
errorCount = 15
totalCount = 1000

# Constants
FITNESS_HIGH = 1.0
FITNESS_LOW = 0.0

# Calculate fitness
fitnessScore = calculateFitness(errorCount, totalCount)
print("Fitness Score:", fitnessScore)

Concerns like Service Level Objectives are also still valid, even with fitness functions, since the SLOs are a contract of sorts between a team and its stakeholders.

Example 4: Maintainability (code quality) 🧹

Atomic | Automated | Triggered | Static

Data source examples: Source code

Code quality is one of the more conventional aspects we can run fitness functions for. Lots of organizations use software such as SonarQube/SonarCloud and newer SaaS offerings like DeepSource, Codacy, and CodeScene. CodeScene is especially unique in its capabilities to uncover technical debt, hotspots, knowledge traffic issues, and more of these complex parts of software engineering.

Any tool in this space should support CI runs — and if your CI and integration strategy allows it — for any branch integration to run checks in the code quality tool prior to integration. This way the tool de-risks new code that includes quality issues. The tool should also support programmatic access to results so we can extend fitness functions, if needed, based on the most current data they have on our code quality. Generally, you should be able to automate quite extensively for things like limits on code complexity and cyclic dependencies.

As mentioned in this great article by ThoughtWorks, we could set up fitness functions to retrieve tool-calculated quality scores for the codebase.

How various badges can show tool-derived scores in GitHub. These could reasonably be used in your own fitness functions, too.

Typically, tests with the atomic/automated/triggered/static dimensions can be run as early as during the pre-commit check. Tools like Husky make it effortless to make such hooks.

Example 5: Maintainability (structure, dependencies…) 📚

Atomic | Automated | Triggered | Static

Data source examples: Source code

A key component of maintainable solutions is a sound software architecture, in terms of structure, packages, namespaces, and dependencies. This is so important, in fact, that a key reason for developer attrition is dissatisfaction with the codebase they work on (incl. technical debt). Everybody wins if this is taken care of!

For this, you can use tools like dependency-cruiser, NetArchTest, ArchUnit, or JDepend, depending on your language. People tell me even ESLint can handle this with plugins. You tell the tool what you expect—such as domain constructs being disallowed to call higher-level constructs and that you don’t want any cyclical relations—and it’ll analyze how the code holds together. If you have a house style or want to follow a known convention, then this is the right way.

For something this complex, you likely don’t want any manual file content parsing—instead rely on dedicated tools (like the above) or ASTs (e.g. TypeScript’s AST) that can accurately handle imports, etc.

In this area, I want to mention something I built: StandardLint.

standardlint
Extensible standards linter and auditor.. Latest version: 1.0.5, last published: a month ago. Start using standardlint…www.npmjs.com

StandardLint makes it convenient and easy to set up guardrails and guidelines for development teams and make sure they follow your house conventions. It comes with checks for having things like:

Code owners
Templates in place
Diagrams
SLO information
Warnings for plain Console usage

…and more.

Example 6: Safe deployments ✅

Holistic | Automated | Triggered | Dynamic

Data source examples: APIs for compute services and API Gateway

All-at-once deployments are super risky but still very common. They immediately shift one version of software to another. A safer way to do deployments is to slowly shifting traffic onto the new code while checking for any odd, unexpected behavior and stopping if you pass a threshold of errors.

This is very close to a fitness function. In fact, it’s something I haven’t even thought of as a fitness function until I saw that Danilo Poccio from AWS has an implementation explicitly calling this a fitness function.

One of my own (old) implementations looks like this:

const handler = async (event, context, callback) => {
 const param = event.queryStringParameters ? Object.keys(event.queryStringParameters)[0] : null;

 let statusCode = 200;
 let body = "Hello World!";

 if (param === "throw") throw new Error("SERVER: Throwing error!");
 else if (param === "error") {
 console.error("SERVER: Error!");
 statusCode = 500;
 body = "SERVER: Error!"
 } else if (param === "warn") {
 console.warn("SERVER: Warning!");
 statusCode = 500;
 body = "SERVER: Warning!"
 }

 const response = {
 statusCode,
 body,
 headers: {
 "Content-Type": "text/plain"
 }
 };

 callback(null, response);
};

module.exports = { handler };

The above is run before shifting traffic while doing a gradual (canary) deployment of a Lambda function. Tactics like this are well within the concept of a fitness function.

For more, see:

GitHub - mikaelvesavuori/multicloud-serverless-canary: Ever wondered how you actually do gradual…
Ever wondered how you actually do gradual canary rollouts with AWS, Azure or GCP's serverless platforms? Look no…github.com

Example 7: Language and grammar ✍️

Atomic | Automated | Triggered | Static

Data source examples: Source code

You can lint more than code — you can lint the overall language!

If you have any non-trivial degree of documentation, then it’s worth looking into tooling that supports well-written output. This approach is also mentioned here as being used by Neil Ford to check for gendered pronouns in his writing.

In this case, it would be advisable to use off-the-shelf open-source tools such as Alex and Vale to run checks on your writing and Markdown files. This advice can be generally transferred to any grammar/writing application that offers programmable access, such as with APIs; one of these is Grammarly and its API.

Example 8: Security 🔒

Practically any combination of dimensions

Data source examples: Source code; APIs for compute and networking services (DAST testing)

Nowadays CI tools tend to come with some support for security checking out of the box. GitHub comes with Dependabot which warns about compromised package dependencies and known vulnerabilities in your code. GitLab has similar capabilities, and you’ll likely get it in most mature, modern CI products today.

You could also add SAST tools like Snyk, Checkmarx, or Mend.io to your mix if you want something detached from your CI tool. Security is often interweaved in other adjacent categories of tools, such as code quality tools, as security is an important factor of overall quality. Make sure you configure tooling to use the same criteria across your entire toolset if you have overlapping capabilities.

Example 9: Compliance 👮‍♀️

Practically any combination of dimensions

Data source examples: Source code; APIs for storage and database resources; assessment tools like AWS GuardDuty

Something that’ll score high on your “adult points” scorecard is compliance-related fitness functions for areas including GDPR, PCI-DSS, or HIPAA. Hardly sexy, but necessary to deal with in many cases and smart to automate because, frankly, most developers and architects have less taxing and boring work to deal with. Might as well automate all of it.

This subject is hard to encapsulate in any single example, but you could run checks that:

Enumerate the number of databases and storage buckets with public access (or similar);
Verify that sensitive resources have appropriately slim IAM accesses;
Check on the prevalence of manual user access to sensitive data;
Check on mundane things like missing resource tags
Ensuring any external package dependencies follow vetted open-source licenses.

These are all in the prime territory for static analysis and (mis)configuration checking with tools such as Checkov, Terrascan, cfn-lint, cdk-nag, or license-checker. The fitness functions could work in a “detective” mode, while static analysis gets gradually introduced in teams’ work.

Example 10: Cost efficiency 💸

Practically any combination of dimensions

Data source examples: Source code; APIs for utilization (e.g. CloudWatch); APIs for billing

Keep tabs on the costs for your cloud account, or even specific services in it. You could set a threshold for a maximum percentage increase in spending compared to the last billing period. Or set up a function to calculate the amount of income made through the system versus the cost of running it!

Sure, some of this is already widely available in AWS, GCP, and Azure, but it might not be directly available (or notified) to teams when they are on a bad upward cost slope. Fitness functions could alleviate this.

# Pseudo code example
function predictCost(currentMonthData, previousMonthData):
 # Implement your prediction logic here to estimate the cost for the current month
 predictedCost = ... # Your prediction calculation
 
 return predictedCost

function calculateFitness(predictedCost, previousMonthCost):
 costIncreasePercentage = ((predictedCost - previousMonthCost) / previousMonthCost) * 100
 
 if costIncreasePercentage <= 10.0:
 return FITNESS_HIGH
 else:
 return FITNESS_LOW

# Example input data
previousMonthCost = 10000.0
previousMonthData = ... # Data from the previous month's usage
currentMonthData = ... # Data for the current month's usage

# Constants
FITNESS_HIGH = 1.0
FITNESS_LOW = 0.0

# Predict cost for the current month
predictedCost = predictCost(currentMonthData, previousMonthData)

# Calculate fitness
fitnessScore = calculateFitness(predictedCost, previousMonthCost)
print("Fitness Score:", fitnessScore)

A tool like Infracost will be able to tell you what an approximate cost would be for your infrastructure when you make changes to it, which works well for spending that is linear or otherwise easy to extrapolate. However, note that dynamic running costs, such as pay-per-use APIs, still need continuous monitoring as you’ll never really know what the spend is otherwise.

See also FinOps as an interesting overall approach, together with DevOps, to put back both visibility but also the accountability around costs with the teams running and building your systems.

If you enjoyed the above format, you’ll love this article:

Fitness function-driven development
Fitness function-driven development ensures your code has structured, sensible logging during the development process…www.thoughtworks.com

Simplifying by using ArchFit to run architectural fitness functions

Some of the above areas are actually possible to start addressing with minimal work from your side if you choose to use ArchFit, a tool I’ve built to run architectural fitness functions that are generic enough to be interesting to many organizations and types of users.

archfit
Validate the fitness of your AWS solutions, without the heavy lifting!. Latest version: 1.0.0, last published: 2 days…www.npmjs.com

By the way, it’s open source, so you are free to contribute to the project on GitHub! I’m always happy for new ideas on improvements and new fitness functions to introduce.

GitHub - mikaelvesavuori/archfit: Validate the fitness of your AWS solutions, without the heavy…
Validate the fitness of your AWS solutions, without the heavy lifting! - GitHub - mikaelvesavuori/archfit: Validate the…github.com

The way ArchFit is implemented and aimed (at least in version 1) is to pull data of actual usage for a period of time, and then run fitness functions to evaluate and return results for them. Usually, the fitness functions require 1–3 data sources. More elaborate approaches might both run traffic (or simulate it) or do complex multivariate analysis. I think there is beauty in keeping things as simple and lightweight as possible until it’s clear something more is actually needed.

You can use it as either a CLI tool or as a Node library, so it should fit into most use cases. Library usage is as simple as:

import { ArchFitConfiguration, createNewArchFit } from 'archfit';

async function run() {
 const config: ArchFitConfiguration = {
 region: 'eu-north-1', // AWS region
 currency: 'EUR', // AWS currency
 period: 30, // period in days to cover
 writeReport: true, // writes a report to `archfit.results.json`
 tests: [
 { name: 'APIGatewayErrorRate', threshold: 0 },
 { name: 'APIGatewayRequestValidation', threshold: 0 },
 {
 name: 'CustomTaggedResources',
 threshold: 50,
 required: ['STAGE', 'Usage']
 },
 { name: 'DynamoDBOnDemandMode', threshold: 100 },
 { name: 'DynamoDBProvisionedThroughput', threshold: 5 },
 { name: 'LambdaArchitecture', threshold: 100 },
 { name: 'LambdaDeadLetterQueueUsage', threshold: 100 },
 { name: 'LambdaMemoryCap', threshold: 512 },
 { name: 'LambdaRuntimes', threshold: 100 },
 { name: 'LambdaTimeouts', threshold: 0 },
 { name: 'LambdaVersioning', threshold: 0 },
 { name: 'PublicExposure', threshold: 0 },
 { name: 'RatioServersToServerless', threshold: 0 },
 {
 name: 'SpendTrend',
 threshold: 0
 }
 ]
 };

 const archfit = await createNewArchFit(config);
 const results = archfit.runTests();

 console.log(results);
}

run();

Using the CLI is just a matter of having the configuration as a JSON file, named archfit.json, in whatever directory you are running the archfit command from.

The documentation should be able to answer any questions you have, otherwise post a GitHub Issue and I’ll respond to it in due course.

In closing, ArchFit offers a pre-packaged set of fitness functions that should help with informing decisions around several typical use cases. I’ve offered it as a way to make fitness functions somewhat less “hidden” and org-specific, as I believe not every single utility is completely unique.

Where to start?

Amazon writes that you should start with,

* Gathering the most important system quality attributes.

* Beginning with approximately three meaningful fitness functions relying on the API operations available.

* Building a dashboard that shows progress over time, share it with your teams, and rely on this data in your daily work.

Part of your job is to come to terms with which unique functions you need to test specific conditions, e.g. security relating to scalability, resiliency measures relating to latency, etc. Which fitness functions you should set up is entirely dependent on your needs, regulations and legal requirements, your organization, known issues, and so on.

This guide is at its end, and I hope you’ve found it useful and helpful to your own journey. If I’ve done my part well, architectural fitness functions should now be far less mysterious and more intuitive for you.

Best of luck, and don’t miss the bonus resources below!