Building Towards Continuous Innovation: Five Essential and Four Advanced Must-Have Capabilities for Agile Retailers

With Amazon setting the pace of innovation at 3,000 live upgrades per day, agility has become every retailer's top strategic priority. But the replacement of the old, months-long release cycle with a continuous stream of innovation forces a fundamental re-think of how software is created and operated.

Founder and CEO of Qubell. Founder and executive chair at Grid Dynamics. Working hard to turn good ideas into great products.

Agility is a business imperative

According to a recent survey, retail CEOs believe that their industry will see more change in the next 10 years than in the previous 50 years. Not surprisingly, the technology will be the key driver of growth. Using technology to keep up with the sophisticated and connected consumers across all channels, enable personalization, provide new electronic payment options, offer same-day delivery of goods, optimize supply chain and go global is quickly becoming essential for every retailer.

All this requires brand new software and extensive experimentation. The traditional ways of planning, developing, testing and deploying software in "big releases" that happens at the end of a long, multi-month planning and development cycle are quickly becoming a liability.

The best web companies like Amazon, Google and Facebook are extremely adapt at launching new digital services quickly and then continuously improve the user experience through rapid iterations. They are able to control the cadence for new capabilities down to a minor feature enhancement or a bug fix, releasing multiple improvements every day, and in case of Amazon a whopping 3,000 times per day.

That kind of agility helps remove the guesswork about what the customers want from their product. Google reportedly tested over 500 shades of blue in search of that perfect color of the linked text that somehow makes people want to click on it. Agile companies can try new ideas quickly, measure what part of user experience customers love or hate and make informed decisions what features to fund or kill.

While everyone agrees that becoming more agile is a business imperative, getting the agility engineered into every facet of software operations is a huge challenge. Here are the five basic capabilities that every retailer must embrace for a chance to stay in the game, and four advanced capabilities required to compete for market leadership with the fastest and fittest competitors.

Five foundational capabilities of enterprise automation

1. Virtualized infrastructure

Speed and agility boils down to controls over your software and everything required to run it. Virtualization is the basic level of separation of software applications from the physical hardware that's a prerequisite for all advanced automation capabilities. While most of the retail companies have already embraced virtualization, some are still using non-ventualized hardware for production systems.

If you are one of those companies, you'll find it increasingly harder to make quick changes in the configuration of your infrastructure for development, testing and continuous deployment of new features. The best place to start is to virtualize your development and test environments before tackling the production.

2. The cloud: fluid capacity on-demand

The cloud is profoundly more that a virtualized infrastructure. In the context of this discussion, the cloud means that a person or application can always ask for one or more VMs (fluid capacity), which will be provided immediately upon request (on-demand). Fluid capacity on-demand is essential for spinning up test environments as a part of agile development and continuous delivery processes, and release the resources back when no longer needed.

Most data centers run by retail companies, even those that had been fully virtualized, don't permit dynamic, event-driven spin-up of new VMs. Organizations must choose from a growing array of reliable and secure public clouds including those from Amazon EC2, Google Cloud Platform and MS Azure or deploy a private cloud solution.

3. Node-level configuration management

Popular configuration management tools like Chef and Puppet assure that the infrastructure is always configured correctly and stays in compliance with desired policies over time. The advantage of configuration management over typical automation scripts is that it is "goal-oriented": instead of focusing on the sequence of steps to do something (for example, deploy a database server), the configuration management software focuses on the desired resulting state of a node (database server is deployed on that node). Although we'll show in the later sections that a correctly configured application is much more than a collection of correctly configured nodes, node-level configuration management is an essential basic capability of a modern infrastructure management.

4. Agile software development

Agile development methodology has been well established for over a decade-and-a-half. but it wasn't until very recently that is became widely accepted as the preferred way to develop and evolve software. Agile methodology advocates incremental development of features over time, direct involvement of business owners in defining and prioritizing features, and focus on software testability as a matter of design.

While there are multiple mainstream variations of agile methodologies like "scrum" or "kanban", the successful agile organizations adopt the agile principles to their specific needs, tools and culture. If agile is not yet practices in your organization in some practical form, a fast innovation cycle is not likely to be achievable.

5. Continuous integration

Continuos integration (CI) is a software development practice that allows teams of developers to work "safely" and concurrently on the same code base. Continuous integration process forces every code change to be instantly integrated into the shared codebase and tested to quickly figure out if anything is "broken". CI server constantly monitors changes to the source code and runs a build and test process whenever something changes. The change that "breaks" CI is rejected and sent back to the author to be fixed.

CI achieves two important objectives: (a) keeps the common code base free of obvious problems introduced by concurrent development, and (b) forces the developers to commit their code often. The last point is essential to agility. The longer she waits to integrate her code with the main codebase, the more likely it is that someone else already introduced a change that will break her new code. Since the last person to break CI "losses" and has to redo the work, it pays to commit continuously in small increments rather than infrequently in big batches.

To summarize, the combination of virtualization, cloud, configuration management, agile development and CI create a sound foundation for the control over the process, application and infrastructure upon which more advanced capabilities can be built, ultimately leading to the facilitation of continuous innovation.

Four advanced capabilities required for continuous innovation

1. Formalized test pipeline

How frequently your organization can process changes comes down to your controls over software quality. Put simply, shorter cycle requires better testability. To make frequent changes to production safe and routine, the cost of verification that a change is safe to introduce into a production environment must be minimal.

Intuitively, the relationship between release cycle and testing cycle should be self-evident. If you release weekly, the testing cycle should be less than 1 week. Daily releases require testing cycle less than 24 hours, hourly releases – less that 60 min, and so on. Theoretically, if the incremental cost of one full test cycle is zero, the software changes can be released constantly and instantly. While achieving 100% reliable verification at zero cost for any change is not practical, the need for fast, reliable and inexpensive testing is clear.

The first step towards continuous delivery is the formalization of release verification process in a series of test stages and quality controls. The inspiration for the pipeline comes from the greatest industrial invention of the 20th century: a conveyor belt.

Some tests can be quick and fully automated, such as unit tests and smoke tests. These tests can be applied to every change as soon as it was committed to the source control system and is usually integrated with a CI process.

Other tests, such as full regression tests can take many hours to execute even on optimized test clusters. Such tests are typically scheduled to run nightly, or on major software builds. Yet other tests related to system-level integration, performance or scalability may be done by specialized groups, mix automated and manual steps and be applied selectively to major release candidates only. Most of companies also have some form of purely manual user acceptance testing.

While there are common principles and blueprints, every organization must define its own test pipeline optimized around its applications, test capabilities, infrastructure, resources and skill. Initially, it is common to start with a very simple pipeline consisting of continuous integration server running unit tests, a partially-automated nightly regression testing and manual user acceptance tests. Over time, with proper investment, the pipelines become more automated and sophisticated, leading to shorter release cycle and hence the increased agility of the organization.

2. Configuration management at business service level

Studies show that at least 40% of production outages are attributed to system misconfigurations. Configuration problems are not only pervasive, but hard to detect, replicate and eradicate. Given the number of components and parts that make up a running application, this is hardly surprising. It is too easy to deploy one of the services into the wrong VM causing subtle routing problems or performance degradation evident only under a certain condition, such as load spike or failure of another component.

The number goes significantly higher if we include configuration problems at the level of business services, not just infrastructure. One business service might stop working when another business service gets upgraded, changes APIs or loads new data from a source with incompatible schema.

If we think of agility as organization's ability to rapidly evolve its portfolio of business services, it becomes clear that companies need to treat "business service" as a top-level entity and guarantee its 100% availability despite the fact that all of its internal components and external dependencies change all the time.

Because business services are composite entities made out of components that themselves are composite services, the configuration is a tree of dependencies of arbitrary complexity. Basically, its "turtles all the way down". And since we are primarily concerned with the uptime of live services, the dependency management of a "live configuration" is a runtime problem rather than deployment problem.

Today, no common enterprise tool solves this problem well. CMDBs are limited to infrastructure-centric configuration management, while tools like Chef and Puppet are focused on VM-level configuration. Qubell is a new kind of devops platform that give enterprises ability to define, codify and enforce the validity of configurations of applications and business services from deployment to destruction. Whether the organization uses Qubell or home-grown alternatives, there are common principles of configuration management that have to be supported:

Decompose applications into internal components and services, codify the dependencies, version all configuration data and keep it under source control.
Identify external dependencies that can affect the validity of the configuration. This includes database connections, middleware containers, network services, internal SOA services, external APIs, tools, libraries and infrastructure services, to mention some.
Identify implicit dependencies such as reliance on the shared data that can be changed by other applications. Implicit dependencies are hard to track, test and manage. Whenever possible, substitute implicit dependencies by explicit interfaces and APIs.
At deployment time, check that all dependencies have been successfully resolved before finishing the deployment.
For deployment into test environments, assure that all external services are accessible from that environment. This often means distributing a test instance, a proxy or a stub of the external service with test instance of the application.
Write automated regression tests that exercise internal, external and implicit dependencies. For stress and performance testing, investigate the impact of the load and throughput of the external systems on the behavior of the service.

3. Dynamic test environments replace static infrastructure

Most companies have dedicated infrastructure for different functions, such as development, testing, staging, production. The infrastructure is allocated once, configured for a specific purpose and never taken down until the project or the function is no longer relevant.

There are several serious problems with this approach in highly agile environments:

Traffic jam between projects. When many teams of developers work on different parts of the system in parallel, they will bump heads on the access to shared test environments. Someone will end up waiting a long time to run short tests. The need to schedule access to an environment causes teams to want to book it for a long time to get all of their testing done before they lose their spot. Instead of iterative, just-in-time testing in small increments, teams are pushed into longer releases by the bottleneck.
Capacity fragmentation. Most of the environments are not used most of the time, and a few environments are sometimes over-subscribed. For example, at night time the regression tests are run while development environments are idle, and vice versa. The overall capacity utilization is low, while the environments that need extra capacity can't get it when they need it most.
Configuration drift. The configuration of the infrastructure in every environment undergoes constant change, making it very difficult to assure that the testing started in a known state. This is especially problematic when the infrastructure is reconfigured reconfigured between appliction versions and test runs.

Together, these issues conspire to jam the conveyor belt of the continuous delivery. To eliminate these bottlenecks once and for all, the dedicated static infrastructure created once to a fixed configuration has to be replaced by the dynamic environments provisioned automatically on demand, reconfigured to the right version with a click of a button and torn down when no longer needed.

4. Self-service for developers

Traditionally, the responsibility for providing computing infrastructure for all development needs lies with the IT ops. Every time a developer needs some computing infrastructure, she must file a ticket. This ticket must be filed -> scheduled to be analyzed -> analyzed -> send back for clarification -> clarified (a few times) -> scheduled to be approved -> approved -> scheduled to be executed -> executed -> verified -> problem reported (a few times)-> problem scheduled to be fixed -> fixed.

A typical exchange between a developer and IT that can take months:

Developer: please, provision 3 VMs: a database, an app server and a web server
*IT Ops *(1 week later): what datacenter, what security zone?

Developer: I don't know and I don't care, I can use anything you give me
IT Ops (frustrated, 1 week later): we need to know that information to proceed

Developer (frustrated, 1 week later): how do I find out what information to give you?

Even if the infrastructure is virtualized, the capacity is available, VM provisioning is automated and templates exist for all three VMs requested, the fact that the actual provisioning will eventually take only 10 minutes is masked by the month-long negotiation process between people with totally different context.

The only way to solve the problem and eliminate process bottleneck is to empower the developers to get what they need, when they need it with self-service.

The virtualization of the infrastructure, the availability of resources, the automation of infrastructure provisioning, the configuration management and the cloud all need to be in place as pre-requisites. Yet without the self-service that empowers developers to get what they need, when they need it, acting within the policies and rules and getting instant gratification with the order - or immediate feedback that something has not worked correctly and needs to be reordered - the organizational agility is simply not possible.

Victoria Livschitz