Ballast Lane Applications, as a company, excels at all kinds of projects; those that we start from scratch, putting all our engineering capacity to work; those that already have a base and that our team is in charge of improving; some that use our power as a team to help an existing client to deliver more features and more efficiently. In this discussion we will focus on the case in which the need is to migrate infrastructure and integrations from one provider to another, including data and mission critical systems.
First, what are the considerations to take into account before the migration happens? Although some are imposed by the client under special conditions, as a team we have a standardized process to follow, that includes:
1. Systems compatibility:
As a team we must guarantee that the versions of the systems (servers, DBs, storage, cache servers) to be migrated are the same on the source and destination platforms. If there are pending upgrades these must be executed BEFORE migrating to the new system, since this will guarantee two things: The first is to have similar setups that allow us to continue executing all the operations as we have been doing in the origin platform, and the second is that if there are errors after the migration in package compatibility or similar, we know that it is as result of the migration and not the implemented system.
2. Test, test and more testing:
It may seem obvious, but it is still the most critical phase of a migration. Our approach is to replicate the migration of each of the environments and their resources as many times as possible, simulating with replicas how everything would be, creating parallel environments with traffic and the same behavior. Ideally we use the same data that the real migration would have. This allows us to estimate the time it will take to execute the operation, see what the behavior will be on the new platform, and discover any compatibility issues that may occur.
3. Data integrity and post migration smoke test:
If all the implemented strategies were correct, and the plan was strictly followed, the infrastructure could be operational immediately, however it is better to be absolutely sure. Before defining that the systems can start operating, we decide on and define a smoke test that guarantees that the data is correct and that the behavior is as expected. We avoid writing to the databases during this period to avoid discrepancies with the original data. If all goes well, the DNS to the new infrastructure will be enabled.
4. Downtime, the perfect time window
Ideally there would never be downtime, but being an operation that involves data migration it is best to at least turn off the writing on the storage units to prevent the data from being corrupted. To find the ideal time window we analyze the traffic and select a period of time compatible with the time we already know it will take. Then we define the required team members and the pieces are adjusted so that the plan is executed in the defined window.
There are never too many elements to be considered, the more you try and analyze the less chance there is that something will go wrong. However, this can not be guaranteed. This is the reason why another requirement is to have a clear rollback guide that everyone understands and that allows returning to a stable state if there are issues. This relies on engaging the QA team with a predefined plan on the acceptance criteria for new products. If the acceptance criteria are not met, the rollback is executed. This is something that should not be improvised or created on the fly, but rather must be written beforehand.
Speaking of post-migration steps, it is also important to enable all monitoring systems that give observability to the new systems. These are normally defined before, then they are activated and verified when the migration is done. In the time following the execution, this continues to be part of our sprint: identify points of failure and increase the alarms that will allow us to know the real-time behavior of the newly created resources. At this stage, topics such as performance, CI/CD and incident management become relevant, which in many cases did not previously exist or were not implemented properly.
Finally, our approach includes retaining the old infrastructure until we are sure the new one is working properly. For this, we create backups that enable us to return to the old infrastructure in the event that something wrong has happened. After the migration has been approved, we proceed to decommissioning the resources, archive them or delete them as requested.
Every time we have taken on migrations of this type, the steps can change, there are additions, or sometimes it turns out to be simpler. Nonetheless, the elements mentioned above are the standard that allows us to guarantee to our clients that we will carry out a clean process that tries to reduce the margin of error as much as possible. Each migration of data, infrastructure or similar leaves us learning as a company and allows us to upgrade our internal database of knowledge, which in the end leads us to provide a better quality service that covers increasingly complex scenarios.