The intricacies of large-scale software systems are nothing out of the ordinary for a site reliability engineer. Maxime Brugidou knows what it’s like to spend each day keeping one of the largest computing platforms in the ad tech world running like clockwork—he was Criteo’s first site reliability engineer.
Now he’s an engineering director on the site reliability engineering (SRE) team at Criteo’s Paris headquarters responsible for high-level coordination, setting priorities, and keeping his team motivated to continuously improve our systems.
We talked with Maxime about his role at Criteo. Keep reading to hear about the future of our SRE team and his advice for aspiring engineers.
Tell us about your journey with Criteo.
I joined Criteo in 2010, first working as a software developer on our recommendation engine. Then I became a scalability engineer and pioneered our Hadoop stack. Finally, I became the first site reliability engineer at Criteo and I led a small team to automate our servers with Chef. Now I’m an engineering director in the SRE department of Criteo.
The SRE team was created in 2014 as a natural evolution of our practices to apply engineering principles to operations. We did not really have a formal strategy, but we were facing operational challenges at such a large scale that it was impossible not to apply engineering tactics and software development principles to the job.
Most teams within SRE have evolved a lot, and all of them are doing software engineering on a daily basis.
How big is the SRE team now?
The entire department has more than 140 engineers and has the biggest fleet of servers—15,000, which is about 50% of our total server count. And it is all managed by just five people! Of course, this is possible thanks to the support and tooling provided by other Criteo teams.
Can you tell us more about your current role?
I am responsible for the core teams of SRE that provide the services the organization depends on. We provide a large-scale compute and storage platform using services called Mesos and Hadoop.
My role is mainly to provide high-level directions and to coordinate teams both within and outside my group. I also make sure that the right priorities are set so that all teams are functional with staffing and motivation.
What’s your biggest challenge in your role?
Over the past eight years, I’ve seen many phases of growth. I believe we have already tackled most of the low-hanging fruit in terms of system automation.
However, now we are facing challenges that are harder. There is only a small group of companies that operate at such a scale, and even if bigger players have shared a lot about their findings, we can’t simply reuse existing open-source tools in our production environment as easily as we used to. We must design our own, or make the existing ones evolve to suit our scale.
Because of this, we have matured a lot and it is definitely for the better. We have built a strong culture of engineering and collaboration. Now we are focusing on more important things, like greater diversity.
What do you like most about your job?
I really enjoy tackling hard problems with incredible engineers. I like to focus on large-scale infrastructure and architectures. Every day I feel like I am part of an organization that has the means to transform our field.
From day to day, we are facing very interesting problems. This includes low-level technical optimizations, high-level designs of distributed systems, and the organizational considerations that come from them. Every person I interact with has a great approach to problem-solving, can provide feedback, and challenges the status quo, which is very exciting!
Lightning round: What are three things you want people to know about SRE?
Site reliability engineers do not do manual operational tasks often—it’s their job to get rid of that.
The best site reliability engineers are great software engineers.
We don’t worship our pager, but we do take pride on keeping the lights on.
What career advice would you tell your younger self?
When I was a junior engineer, I had the tendency to criticize things without seeing the big picture, especially without helping to fix things that I criticized. I realized later that within a company such as Criteo, anyone can help build something better.
Be curious about how things are working under the hood. If you are in a company and you depend on something that you feel is not working well or could be improved, ask to move to the right team to fix that.
How can one become an SRE today when there’s no official school for it?
Most schools have some distributed systems and systems engineering courses, but they rarely train at running operations, monitoring, or large-scale architecture. You can learn all that on the job.
If you prefer to join a small startup, make sure that someone senior will be with you to accelerate your learning curve. If you start at a larger company, make sure to connect with all the teams around you to explore how things are working at the company level.
You need to enjoy problem solving and fixing things and have a strong sense of ownership to enjoy improving complex systems. A good way to do that is to tinker with Linux servers, for example, no matter the purpose. You will learn Linux, networking, and the joy of spending hours understanding why you broke your email or IRC server while applying some random sysctl.
What’s your parting advice for SRE candidates?
I’d like to emphasize that candidates do not need to talk about all the latest catchwords they hear about on blogs or reddit. Most of these buzzwords are usually very interesting, but it’s important to ensure we solve the actual problems and propose solutions to help our business. Using the latest tech is cool and all, but let’s not miss the big picture.
If technology, innovation, and complexity get you out of bed in the morning, SRE could be for you! Click here to view all of our open roles.