Value Created by Testing Infrastructure Code

The simplest answer is Customer Trust.

As you can see from above screenshot, I’m able to get a quick feedback, as fast as 5 seconds, from my Infrastructure Code, CDK Application if I’m going to be earning more Customer Trust or not after the deployment…

In this tiny setup, I have an important DynamoDB Table that stores important Customer data. This data needs to be available all the time and if I get higher demand, I should be able to quickly scale up the table at 80% utilisation. All of these are possible with DynamoDB but none of this could work if you don’t invest on it. The investment can be manually configuring these on AWS Console or even worse, with no investment, you can configure these when your system slows down or you get customer complaints. Of course, no business prefers their customers complaining about their service. So, you would be able to proactively handle it by investing in automation and Infrastructure as Code using CDK.

However, even in this setup, you may face issues. When you would like to go faster, errors could happen and you would like to get a fast feedback instead of long and painful customer feedback in production. In order to achieve this, TDD and CI/CD are used reasonably and responsibly. CDK in this context is no foreigner to these concepts and provides great tooling and support to test your Infrastructure as you create it via unit tests.

The biggest value of these tests is being able to reflect Operational Excellence (OpEx) concerns as early as at the unit test stage. By writing unit tests according to your OpEx concerns in your CDK application, it will provide a higher confidence when running the service in production. This confidence will yield more Customer trust as you will be able to reliably run a service and fortunately find out issues before Customer does.

In the above example screenshot, my OpEx concerns are well covered with unit tests. Now, every time I deploy my application, I can simply run npm run test on my infrastructure code and make sure that there is no drift in my OpEx concern in production. You can take this to another level by scanning existing repositories in your organisation and alert service owners if company wide OpEx concerns are not covered at code level using some Bots. Also, these metrics could be extremely useful to reflect on service dashboards.

Speaking of these OpEx concerns, I can easily say almost 90%+ incidents I was part of in the past had an action item or two about alerting, monitoring and more observability as a result of the incident report. These now can be part of the code and forever guard the concerns. Also, it would be much easier to close the loop on an incident report by simply reflecting these on a dashboard and sharing the code change (PR/MR/CR whatever your organisation refers to) into the report. Long term learnings of an incident from OpEx point of view can be internalised and automated with this simple methodology.

While talking about the business value of this, let’s look at the code and connect the dots. Next section quickly summarise the code setup and hopefully will influence you write more tests for your infrastructure code.

Sample Setup

I have a very simple CDK application stack with 2 NestedStack constructs. One stack is for Monitoring (alert topic, sns, dashboards) and the other one is the Database. I also use CDK v2 for these sample codes.

From AWS Docs, NestedStacks are:

The NestedStack construct offers a way around the AWS CloudFormation 500-resource limit for stacks. A nested stack counts as only one resource in the stack that contains it, but can itself contain up to 500 resources, including additional nested stacks.

The scope of a nested stack must be a Stack or NestedStack construct. The nested stack needn’t be declared lexically inside its parent stack; it is necessary only to pass the parent stack as the first parameter (scope) when instantiating the nested stack. Aside from this restriction, defining constructs in a nested stack works exactly the same as in an ordinary stack.

Monitoring Stack

Below creates a dedicated CloudWatch Dashboard and exports the dashboard to be used in other stacks. This way, you will be able to put all related metrics and alerts of your service into a central dashboard.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as subscriptions from 'aws-cdk-lib/aws-sns-subscriptions';


export class MonitoringStack extends cdk.NestedStack {
   
    public readonly dashboard: cloudwatch.Dashboard;

    public readonly alarmTopic: sns.Topic;

    constructor(scope: Construct, id: string, props: MonitoringStackProps) {
        super(scope, id, props);

        this.dashboard = new cloudwatch.Dashboard(this, 'Dashboard', {});

        this.alarmTopic = new sns.Topic(scope, 'AlarmTopic', {
            displayName: 'My Service Alarms',
        });

        const emailSubscription = new subscriptions.EmailSubscription('dev@mycompany.com');
        
        this.alarmTopic.addSubscription(emailSubscription);
    }
}

Database Stack

Below creates a DynamoDB Table, sets Read and Write capacity to 5 while enabling point in time recovery. It also enables to auto-scale to 5x of the table capacity at 80% utilisation. It also adds Read and Write capacity alarms and metrics on CloudWatch Dashboard created in Monitoring Stack.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as cloudwatchActions from 'aws-cdk-lib/aws-cloudwatch-actions';

interface DatabaseStackProps extends cdk.NestedStackProps {
    dashboard: cloudwatch.Dashboard;
    alarmTopic: sns.Topic;
}

export class DatabaseStack extends cdk.NestedStack {
  
  public readonly customerOrdersTable: dynamodb.Table;

  constructor(scope: Construct, id: string, props: DatabaseStackProps) {
    super(scope, id, props);

    this.customerOrdersTable = this.setupTable(scope, props, 'CustomerOrdersTable');
  }

  private setupTable(scope: Construct, props: DatabaseStackProps, id: string): dynamodb.Table {
    //
    const table = new dynamodb.Table(scope, id, {
      partitionKey: { name: 'customerId', type: dynamodb.AttributeType.STRING },
      sortKey: { name: 'orderId', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PROVISIONED,
      readCapacity: 5,
      writeCapacity: 5,
      encryption: dynamodb.TableEncryption.DEFAULT,
      pointInTimeRecovery: true,
      stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
    });

    this.setupAutoScaling(scope, table);
    this.setupDashboard(scope, props, table, id);

    return table;
  }

  private setupAutoScaling(scope: Construct, table: dynamodb.Table) {

    const tableReadScaling = table.autoScaleReadCapacity({ minCapacity: 5, maxCapacity: 25 });
    tableReadScaling.scaleOnUtilization({
      targetUtilizationPercent: 80,
    });

    const tableWriteScaling = table.autoScaleWriteCapacity({ minCapacity: 5, maxCapacity: 25 });
    tableWriteScaling.scaleOnUtilization({
      targetUtilizationPercent: 80,
    });
  }

  

  private setupDashboard(scope: Construct, props: DatabaseStackProps, table: dynamodb.Table, tableName: string){

    const readCapacityAlarm = new cloudwatch.Alarm(scope, `${tableName}ReadCapacity`, {
      metric: table.metricConsumedReadCapacityUnits({
        period: cdk.Duration.minutes(5),
        statistic: 'sum',
      }),
      threshold: 25,
      alarmDescription: `${tableName} Table Read Capacity`,
      evaluationPeriods: 1,
    });

    readCapacityAlarm.addAlarmAction(new cloudwatchActions.SnsAction(props.alarmTopic));

    const readCapacityAlarmWidget = new cloudwatch.AlarmWidget({
      title: `${tableName} Table Read Capacity`,
      alarm: readCapacityAlarm,
    });

    const writeCapacityAlarm = new cloudwatch.Alarm(scope, `${tableName}WriteCapacity`, {
      metric: table.metricConsumedWriteCapacityUnits({
        period: cdk.Duration.minutes(5),
        statistic: 'sum',
      }),
      threshold: 25,
      alarmDescription: `${tableName} Table Write Capacity`,
      evaluationPeriods: 1,
    });

    writeCapacityAlarm.addAlarmAction(new cloudwatchActions.SnsAction(props.alarmTopic));

    const writeCapacityAlarmWidget = new cloudwatch.AlarmWidget({
      title: `${tableName} Table Write Capacity`,
      alarm: writeCapacityAlarm,
    });

    props.dashboard.addWidgets(readCapacityAlarmWidget, writeCapacityAlarmWidget);
  }
}

Now that my stack is setup, this is how I test them. I didn’t share the full test suite but the below is showing the important OpEx concerns under cover.

Testing Database OpEx Concerns

In this simple Jest setup, I’m creating a new test stack and add the stack under test to it. By generating a CloudFormation Template, I’m able to unit test if my CDK setup is covering my concerns or not.

import * as cdk from 'aws-cdk-lib';
import { Template } from 'aws-cdk-lib/assertions';
import { MonitoringStack } from '../lib/stacks/monitoring-stack';
import { DatabaseStack } from '../lib/stacks/database-stack';
import { Stack } from 'aws-cdk-lib';

describe('Test Database Stack', () => {
    let template: Template | null;

    beforeEach(() => {
        const app = new cdk.App();

        const testStack = new Stack(app, 'TestStack');

        const monitoringStack = new MonitoringStack(testStack, 'TestMonitoringStack', {});

        new DatabaseStack(testStack, 'TestDatabaseStack', {
            dashboard: monitoringStack.dashboard,
            alarmTopic: monitoringStack.alarmTopic,
        });

        template = Template.fromStack(testStack);
    });

    afterEach(() => {
        template = null;
    });

    test('Tables have point in time recovery', () => {
        template?.hasResourceProperties('AWS::DynamoDB::Table', {
            PointInTimeRecoverySpecification: {
                PointInTimeRecoveryEnabled: true,
            },
            ProvisionedThroughput: {
                ReadCapacityUnits: 5,
                WriteCapacityUnits: 5,
            },
        });
    });

    test('Tables have READ capacity alarms', () => {
        template?.hasResourceProperties('AWS::CloudWatch::Alarm', {
            MetricName: 'ConsumedReadCapacityUnits',
            Namespace: 'AWS/DynamoDB',
        });
    });

    test('Tables have WRITE capacity alarms', () => {
        template?.hasResourceProperties('AWS::CloudWatch::Alarm', {
            MetricName: 'ConsumedWriteCapacityUnits',
            Namespace: 'AWS/DynamoDB',
        });
    });

    test('Tables have READ capacity scalable target', () => {
        template?.hasResourceProperties('AWS::ApplicationAutoScaling::ScalableTarget', {
            ScalableDimension: 'dynamodb:table:ReadCapacityUnits',
            ServiceNamespace: 'dynamodb',
        });
    });

    test('Tables have WRITE capacity scalable target', () => {
        template?.hasResourceProperties('AWS::ApplicationAutoScaling::ScalableTarget', {
            ScalableDimension: 'dynamodb:table:WriteCapacityUnits',
            ServiceNamespace: 'dynamodb',
        });
    });

    test('Tables have READ auto-scaling policy at 80% utilization', () => {
        template?.hasResourceProperties('AWS::ApplicationAutoScaling::ScalingPolicy', {
            TargetTrackingScalingPolicyConfiguration: {
                PredefinedMetricSpecification: {
                    PredefinedMetricType: 'DynamoDBReadCapacityUtilization',
                },
                TargetValue: 80,
            },
        });
    });

    test('Tables have WRITE auto-scaling policy at 80% utilization', () => {
        template?.hasResourceProperties('AWS::ApplicationAutoScaling::ScalingPolicy', {
            TargetTrackingScalingPolicyConfiguration: {
                PredefinedMetricSpecification: {
                    PredefinedMetricType: 'DynamoDBWriteCapacityUtilization',
                },
                TargetValue: 80,
            },
        });
    });
});

Closing Notes

I hope the above sample codes and overall context could influence you write more unit tests to your infrastructure. To my knowledge, Terraform doesn’t support such setup or even if it is supported, I couldn’t find the productivity and tooling support in my recent attempts like running a simple npm run test in WebStorm or VSCode. This is why I believe CDK is a game changer for developer confidence and productivity when it comes to building and running services that delight customers on AWS.

Where to Next?

For further details on why OpEx is important, you can dive into AWS Well Architected white paper.

Also, there is great tutorials on CDK documentation to get started:

CDK documentation itself teaches how to use services in great detail

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: