Retool Blog | Internal Tools Interview with Ken Liu (Cockroach Labs)

New Developments is a series about internal tools and the people building them. This week, we talked to Ken Liu, an Engineering Director at Cockroach Labs. Ken walked through what internal tools look like at a database company, managing build times, and when to buy SaaS.

Introduce yourself and your team!

I'm Ken Liu, and I work for Cockroach Labs. Cockroach Labs, if you haven't heard of it is a distributed database company. We make a distributed SQL product called CockroachDB, and, we also have a cloud hosted managed CockroachDB product called Cockroach Cloud. I'm an engineering director here, I run a couple of teams: I run security and compliance, I also manage indirectly to the database teams called BulkIO and CDC, and then more relevant to this discussion, I run our developer infrastructure team.

I've been with the company for about two years. I live in New Jersey, but the company's headquarters is based in New York City, and we're pretty distributed. We have offices all over North America and also in Europe.

How do internal tools help your company grow?

So around two years ago, we decided that we really should have a team that owns all these developer tools that we use every day. We have a pretty strong culture in our company of, we're an open source company, and we have a lot of engineers who have particular needs and often we turn to building tools to solve particular needs and especially being a database company, there are things that we need to do every day that you can't just go and find a tool for. So about two years ago, the company decided that we really should, you know, instead of having everyone built out these tools all the time, we should have a dedicated team really be responsible for it, instead of diffused ownership.

Our focus is on developer infrastructure. So it's really about developer productivity, our engineering productivity. So our job is to help our engineering team be more productive by building tools. Kind of a funny saying that someone had is that "we want to hire engineers without hiring" – if we can make people more productive, then we can basically increase the capacity of an engineering team without having to hire more people.

We're definitely growing. We're definitely hiring a lot. And that's actually part of how the company grows is the work that we do. and it's really multiplied by the size of the engineering team. Recently we've gone through a phase where we just started hiring a lot of people really fast, we're in this really fast growth stage as a company. And so we're excited that the work we're doing in developer infrastructure is really able to have more of an impact.

When you have people kind of hacking on tools in their free time, it's definitely fun, but eventually those tools become a critical part of everyone's workflow. And so if you're depending on someone having some time on a Friday to fix something, over time that's not going to be very productive for the whole engineering team.

What kinds of internal tools does your team build and maintain?

We have a lot of the standard things like CI and our build. It's actually maybe more complicated than what you might see at other companies. CockroachDB is a pretty complex piece of software. It's interesting because if you've ever worked with other databases, our Cockroach distribution is just one binary file. So the build is pretty complicated because it's got to integrate Go code, C++ code, there's JavaScript code in there for the internal UI. So just the build itself and doing CI is very, very complex. And we run our CI environment on prem (or not really on-prem, it's hosted in our cloud environment).

So we invest a lot of money in just doing builds. That's pretty complex. We've also built some internal testing tools. Our tools have funny names, named after roaches. So we have something called Roach Prod, another thing called Roach Test. There's something called Roach Dash. These are internal tools that we use for testing, and Roach Dash is a sort of a developer dashboard. These are tools that do things that you really can't find elsewhere. And they're really closely integrated with software.

And then there are things like Github integrations. We recently started a data engineering team. We also own the cloud infrastructure that our developers are working in: one thing that's unique about Cockroache is we are multi-cloud. So our developers are working in GCP and AWS and sometimes Azure.

And so that's a lot to maintain, especially as a team grows, you just have more and more stuff running in these dev environments. And at some point, you have to manage it effectively or else it becomes an issue. So we also do that too.

What does your internal tool stack look like?

CockroachDB is written in Go, we're really big Go shop. CockroachDB and Cockroach Cloud are written in Go. So our tools are mostly in Go as well. There's a little bit of Python in there. We also mostly run our internal tools in GCP because we find it's a little bit easier to manage from a security perspective, because it's integrated with G suite and there's things like IAP to help with managing access.

Like I mentioned before, we also run everything in AWS or many things in AWS. Everything's on GitHub. We're an open source company. We started to use some things like JIRA internally. But yeah, pretty straightforward. Mostly just Go and JavaScript and a little bit of Python and a few other little things here and there, but we're just a very heavy Go shop.

What's the most frustrating part of building and maintaining internal tools?

I think the challenge is that a lot of the tools started their lives as someone's Friday project. So we have this thing at the company called flex Fridays – it's a little bit like Google's 20% time, we view it as a way for engineers to be able to self-direct, whether it's learning or building. You know, solving a gnarly bug that they didn't didn't have time to figure out, or testing a hypothesis or an idea.

That's the genesis of a lot of our internal tools. And so there's a challenge of lots of tools built by lot different people, built as side projects without any necessarily clear roadmap. They're just trying to solve a small problem.

When we formed our team, they inherited a lot of these tools. And so I think the challenge with that is this lack of consistency across all of these things, they were built by a lot of different people and, these tools aren't necessarily at operational maturity. They work well, but sometimes things will die and then without sufficient logging or a normal process of updating libraries or even consistent software design behind some of these tool, it makes it difficult to work on them.

This year we're focusing on some of the operational maturity I mentioned, like putting in alerting, logging, putting monitoring into the environments that running these things in. Yeah. Ideally we don't have engineers coming to us and saying, oh, this tool is broken. We want to be able to detect those failures in advance so nobody is getting interrupted by something breaking.

How does your approach differ between internal tools and the core product?

Most of the engineering teams are working on a specific product or part of a product. They get to spend a longer period of time focusing on a narrower set of problems. Cockroach DB is really complicated and big, but there are specialized teams focusing on every part of the database or every part of our cloud product. So in developer infrastructure, we tend to focus on smaller projects that have shorter cycles.

But we're building for engineers. We think of our engineers as our customers. And so we get that direct feedback, which is helpful, because if something's breaking, you can just talk to so-and-so on Slack and say, hey, I see you have this problem. Or we have a developer infrastructure channel where people are welcome to let us know.

We actually have like a support rotation where one person is responsible for answering all the questions on the team, whenever things are breaking or they just have questions. So that's a little different, to treat people inside the company as your customers – you get shorter feedback loops.

But as I mentioned, we're responsible for a lot of different things. So there's also the challenge of prioritization and figuring out where we can spend the most effort to get the most benefit. And right now we don't have a dedicated product manager. So I also wear that hat. It's interesting to try to balance the needs of different teams within the company.

When should you build an internal tool vs. buy it?

Very often engineers are really smart and they want to solve problems and they enjoy working on internal tools. As a company, we are an infrastructure company, so we tend to lean more towards solving these problems ourselves. There is a tendency to try to come up with a solution ourselves before we go off and buy something. Personally, I tend to lean more on the side of "let's use SaaS" because there are so many tools out there to support basically every need that developers could have, and there's best of breed tools like CircleCI. So I'd rather spend a little bit of money to buy a tool that can do what we need to do and not have to spend a lot of engineering effort reinventing the wheel or solving a problem that's already been solved.

I think there's definitely value to solving a problem quickly. We've also seen that you can really go overboard with SaaS. And for an infrastructure company, security is definitely a concern, right? If you have all your data out there, your company data spread out across all these tools, you know that that's a security challenge.

Definitely our cloud product where our customer data is hosted is not like that. It's a very tightly controlled set of tools that back our cloud product, but engineering tools, you just have a tendency that's like, there are so many Github plugins or so many Slack plugins that do so many things, and you want to be able to experiment with those things. So it's a challenge to keep up with.

I don't think it's that black and white either, but I think in general, I'd rather focus my team on solving like interesting problems that aren't easily solved by tools that already exist.

How do you measure your success and improve your internal tools?

It's a little challenging, right? Because just the word developer productivity or engineering productivity, what does that mean? There's actually been research done that you could look at, kind of naive metrics like how long does it take to do a build? I think that's definitely valuable because you can easily say, well look if someone is spending this many minutes in a day just waiting for a build and you can cut that in half, then it's pretty easy to say that you had an impact on productivity. But it's often not that clear cut.

You maybe want to fail the build faster. I mentioned we spend a lot of money on CI – so should we spend more money on more VMs to parallelize builds more, or should we invest more of our energy updating some of our build tools, which is kind of a long and complicated proposition. Or would our energy be better spent on testing tools.

So not everything is easy as saying just measuring build time. So the way we've been approaching it is yeah, we are measuring build times, because that's, that's a pretty easy thing to measure. But I think I mentioned before, we have a data engineering function, so we've started to look at what are the various metrics? What are the things we can measure? Right. And then start to build a data pipeline where we can start to do some analysis and look at trends over time. That's very quantitative.

We also do regular reach outs to engineers. So one thing we recently started is like a rotation where every person on our team has to let go and spend half an hour meeting with someone, or once a week, to just ask a list of questions, like, what are the things that you're finding to be painful and collecting all these answers – it's sort of like product management. Just talking to customers and getting their input. You've got to talk to people to find out what problems we're running into.

If we can reduce the complexity of the tools that we own, then that makes our team more efficient and that makes us more capable. It gives us more capacity to work on other things, to help other teams. If we're spending a lot of time just being reactive to a tool breaking, then that makes our team unproductive. That's why we've been focusing on operational maturity, because if we can make ourselves more productive, that ultimately helps the people we're working for.

CockroachDB is hiring!

We're definitely hiring engineers. You can take a look at our careers page. We're looking at engineers at every level from like interns all the way to like very senior engineers. We have a lot of roles in backend engineering for CockroachDB, but cloud is a big focus for us too.

Sometimes there's a perception that everyone at this company is working on the database, but we actually need a lot of folks doing what might be called full stack development or doing work in frontend. These are really, really important because the database isn't just SQL, right. We really need to build good observability tools, people need to be able to provision CockroachDB in their cloud environment.

So we have needs across the board. And even if roles aren't open, our recruiting team loves to hear from people. So we'd be excited to talk to anyone who's interested in talking to us.

Reader

New Developments: Ken Liu (Cockroach Labs)