In recent years, Kubernetes has become a popular platform for orchestrating containerized applications. Using Kubernetes in production is a great way to manage deployments, but using it in your test environment also gives developers confidence that they are validating their code under the same conditions it will run in production.
For this and many other reasons, Retool recently transitioned its continuous integration (CI) environment from Azure Pipelines to Buildkite, using Kubernetes in the test environment. Infrastructure engineer Anna Yan led this effort, and recently presented on how Retool accomplished this work at the Unblock Conference on CI/CD. To learn more, you can watch Anna's talk (embedded below), and read on afterward as we dive deeper with Anna by asking questions about her talk and approach to this project.
We caught up with Anna after the talk above to ask a few questions about the process of moving Retool's CI to Kubernetes. After watching the talk, read on to learn more about Anna's work!
How did you find Garden? Are there similar tools/projects in this space that you considered?
Anna: I did some research online, but eventually, an ex-coworker recommended Garden to me. I heard great things about it and tried it out. I also considered another toolkit called Tilt, but it didn't seem like they had a focus on running tests like Garden. It seemed at the time Tilt was mostly focused on local development, which Garden also does. We were trying to hammer in this idea about implementing everything on the same stack, so Garden could help us do both developer environment and CI. It was easier for us to unite, which is why we ultimately went with Garden.
You mentioned that more granular metrics about your test runs were a benefit of switching to this approach - are there any specific insights you’ve gotten from these metrics? If someone else were to implement this approach to CI, what could they look for in these metrics?
Anna: I went over this in my talk, but one of the most useful things we got out of this was metrics per test run, such as CPU, memory, and timing, as well as being able to put each test in a separate container. You can do this with EC2 as well. But with our previous setup, while we got some free metrics, they came in at 5-minute intervals. So if I wanted to see how much CPU or memory a test is using, which is pretty important when setting up tests because I want to give it an appropriate amount of resources and provision it correctly, I couldn’t do it as accurately since the intervals were so big.
When the metrics aren’t granular enough, you have to guess. For example, if your test starts at 12:43 and ends at 12:50, you can say, “Oh, that’s probably my test running”, but you’re not entirely sure. But if the tests have their own container with their own metrics, you can see exactly how many resources that test is using and how long it took to run. I’m pretty sure you can do that without Kubernetes as well, but it’d probably just take some more work.
What was the hardest technical challenge you had to overcome while building out this system?
Anna: Finding a great tool like Garden took away a lot of the challenges. I think that’s part of the learning there - if you can find a good tool, it’s usually a lot better than writing a lot of bad scripts yourself. There’s always a risk of vendor lock-in, but eventually, you need to make a call for your own use case.
I know for a fact that we’d never build something like Garden ourselves. Well, maybe not now, but maybe in like 5 years! But you know, there are companies with massive developer tooling teams, like Uber mobile’s developer tooling team has tons of developers on it. And the scale of the tools they work on - Retool isn’t going to get there anytime soon. We have to reap the benefits sooner than that, so we just have to find good tools to use. I feel like sometimes people have reservations about that - people always want to build something from scratch. It makes sense sometimes, but not always. And it always depends on your individual situation, so we just did what made sense for us, and went with Garden instead of building something ourselves.
Larger companies tend to have more resources to implement something like this, but for many smaller companies that are leaner, it is important to figure out where to allocate resources. What are your thoughts on smaller companies hoping to implement something similar?
Anna: It’s always nice to see that someone had already done it - to know that it's possible to implement what you’re thinking of. I don't want to say that all people can look at my project and be like, “Oh, this is obviously easy to do now, and therefore we could do the same”. I feel like that's not a great way to put it, but when I'm researching products, I would also love to learn about, for example, how big our product is, how big is the team, how long it took, and whether that situation matches our situation. If it does match, it helps a lot by de-risking a project and we can say, “Oh, okay, this company did it with this many people in this many months”. You know, maybe we don't have the exact same skill sets, but it sounds reasonable to do this thing with a similar timeline.
For this specifically, we almost went with a more traditional and safer way. It was only because we tested it and found Garden to help with a lot of the heavy lifting that we realized that we could implement this much quicker than we originally anticipated. So we came to the conclusion that we should spend a little more time and reap the benefits from the beginning instead of playing it safe and redoing the whole thing later.
What were some of the other top options when considering what platform to move to and why did we eventually settle on Kubernetes?
Anna: I think there are two decisions we had to make here. One is the CI platform, which is what I describe as the orchestration platform, and the other is how we decide to run Kubernetes on our platform of choice. For the CI platforms, we have famous ones like Jenkins CI, Travis, and AWS Code Build, for example. And we run Kubernetes workers on these scheduling platforms, which is a different thing. So I can explain how we decided to run our workers on these platforms. Technically, it’s only possible because of the platform that we chose to run it on. But the specific one we chose, BuildKite, is pretty famous for its flexibility. There were only a few other options available, really.
If we decided to run our workers in the VMs, or in the cloud, like Azure or GCP, the main pro is that it’s a well-trodden path. Everyone has done it. Especially with BuildKite, they have an Elasticsearch stack that I shared earlier which is super easy to set up. You can easily scale it up to thousands of workers if you need to, and it all works pretty well, which is a low-effort guaranteed success kind of situation.
The reason we ultimately decided to not go with that was that we wanted our development and production environments to be as similar as possible, so we were pretty set on using Kubernetes. With VMs, you don’t have as many choices to be as flexible on things like resource management. Kubernetes helps with this problem since it helps you be as granular as you like. They can have one with 0.05 cores of CPU and maybe 100 megabytes of memory, and then you can have another one with 2.5 cores of CPU and 4 gigabytes of memory. Kubernetes figures out where those pods should go. Another reason was that we wanted our core stack to be the same everywhere and we wanted our engineering team to share the same skill set as the core team, so we get to have that knowledge transfer and share best practices, which turns out to be pretty important.
What are some of your future plans for improving Retool's Kubernetes setup?
Anna: We have around 5-7 engineers working on this now, working together and sharing experience. We have a dedicated internal engineering team here which is a great feeling. I don’t want to boast about it, but I’m definitely more hopeful here at Retool compared to other companies. We’re trying to get rid of a lot of the remaining flakiness. We have a few tests that return true for one run and return false for another. We have dashboards now and we’re working with the product team to figure that out, so we’re working on burning down those flaky tests.