Automating DevOps: How Netdata Is Redefining Real-Time Infrastructure Monitoring

The following interview is a conversation we had with Costa Tsaousis, CEO and Founder of Netdata, on our podcast Category Visionaries. You can view the full episode here: Over $30 Million Raised to Power the Future of Infrastructure Monitoring

Brett
Hey founders, and thank you for listening. Today I’m speaking with Costa Tsaousis, CEO & Founder of Netdata, an infrastructure monitoring platform that’s raised over 30 million in funding. Costa, thanks for chatting with me today.

Costa Tsaousis
Hi, nice to meet you.

Brett
I’m super excited for this conversation. Let’s go ahead and start with telling our audience a little bit more about where you’re calling from and where you come from. Costa.

Costa Tsaousis
So the data was born out of frustration. I was a c level executive in a fintech company, actually here in Greece, and were migrating some infrastructure from on prem to the cloud, and were facing significant issues. So after testing almost everything, I spent a couple of million in monitoring just to figure out what is happening and what is wrong. I realized that monitoring systems have. Something is wrong there. So I decided to. Initially, it was curiosity why the guys have done it like this, why there is so big time, such a big learning curve, such a big setup and preparation that you have to do, why it’s not real time, why you have to know every metric and go through all the burden to understand exactly what’s happening in very detail.

Costa Tsaousis
Initially, it was out of curiosity to understand why monitoring systems work like this. But then I started experimenting, writing code and working nights and weekends and the likes. After a few months, I managed to solve the problems that were facing. Actually, these were very nasty bugs at the cloud provider infrastructure, and we managed to find them. But anyway, the data was born this way. After working with it for several months, I decided to release it on GitHub. And you know what? Nothing happened. You have spent a lot of time working on a project. It solves your problems. You release it and nothing. So one morning, I write a post on Reddit and say, okay, guys, I build this tool, check it on GitHub if you like it. And boom, it went viral at the top of hacker news. Hundreds of people, thousands of installations.

Costa Tsaousis
You know, it was crazy, amazing. I have never even. It’s something unique. I think that for people to leave this love of this acceptance, this adoption, so after that point, my life changed completely.

Brett
What do you think you got right to get that type of early traction? Did you just have a deep understanding of the problem and you could empathize to really understand the problem? Or how did you get such an extreme level of success so early on?

Costa Tsaousis
The problem with monitoring is that you really have to spend a tremendous amount of time and you need serious skills to understand, to set up a monitoring system and start using it. Initially, that was the biggest concern I had. To my understanding, you know, all of the companies across the world have to go through the same process. It’s exactly the same process in most of the cases. We all use packets, applications, we use a database server or multiple database servers, web servers. We have some custom applications, but most of the infrastructure is packaged applications. But then if you think a bit about it, why people across all the companies have to go through the same process again in order to monitor their standardized infrastructure? That was the initial thing. Initially my thought was not to build a monitoring solution.

Costa Tsaousis
I was looking for a solution to kill the console. So I didn’t want my engineers and my team to spend time on the console tools and the likes. Also, because it was a fintech company, I had several issues with direct access to the system. So this had to be documented, authorized, et cetera. So my initial idea was, ok, let’s build something that will provide everything. The console provides everything you can find by SSH into a server. So all the metrics, independently of whether they are useful or not, or whether we know about them or not. So add everything there and do it in the same granularity, in the same detail as the console provides or per second as a standard for everything. Now, once you build a thing and you know, this starts up and starts collecting stuff by itself, you don’t do anything.

Costa Tsaousis
It finds a database server, it connects to, it starts connecting stuff from the database, it finds these containers, network interfaces, whatever it is there. So once you have an application that starts collecting stuff by itself and now you have everything, then you have more problems later. Okay, how do I visualize all these? Oh, I need something to visualize them automatically. Why? To go through the process of creating every chart and every metric and every alarm by hand. Let the application know all the metrics, all the dashboards, all the alarms that need to be used here, and let it automatically start them up. So that was the idea in the console. Optimize the time required from people in order to troubleshoot. And in this journey you find all the loose ends here and there, and you try to solve them.

Costa Tsaousis
So this is, I think that this is why people loved the data, because the moment you install it, you have a comprehensive monitoring solution. You did nothing actually to get it. You just installed an application. And the beauty of it is that in many cases, it is better than what you can build by hand. Today we have many Fortune 500 companies that they stop. They shut down the monitoring systems that they have developed themselves using, of course, open source tools or proprietary tools or whatever, in order to use the data. Why? Because they find that the completeness of the data is such that they can never do it by themselves. They don’t have the skills, the time, the effort. They don’t want to put the effort there in order to achieve the same level of completeness. So for users on this open source software.

Costa Tsaousis
So for users, this was initially shocking. I was receiving emails from people saying, hey, I am here, I want to work with you. This is amazing what you have built. So it was unlike any other monitoring. Initially in 2016, it was unlike anything else you have seen as a monitoring system. Real time comprehensive, a lot of metrics, thousands of data collections, very fast, et cetera.

Brett
Can you give us an idea of the type of adoption and growth that you’re seeing right now?

Costa Tsaousis
Yes, of course. So today has about 66,000 stars in GitHub. It is leading the observability category in the CNCF landscape. We surpassed elastic. So we are the first observability platform at the CNCF landscape. In terms of stars, of course, this is user acceptance. Then we have about five to 10,000 new users every day signing up to the project. About a quarter of a million docker hub downloads every day, and even on the SaaS offering that we have. So the SaaS offering is a complementary service, let’s say, to the open source on that thing. On the SaaS offering, we have about 150 to 200 business signups every day. And we monitor. The SaaS offering currently monitors 100,000 nodes. About 2000 new nodes are added every day. So big numbers, big project.

Brett
What do you attribute to that success? What have you gotten right from a marketing and growth perspective, I think the.

Costa Tsaousis
Simplicity of the design is the first thing. The second is that monitoring is hard and we’re trying to simplify it. So we’re trying to figure out ways to make it work by itself. So if you think a bit about it’s like putting the knowledge of monitoring into the tool. Let’s take for example, if you take the most popular today, monitoring open source monitoring solution is Prometheus and Grafana. But if you install Prometheus and Grafana, you have to go through a very steep learning curve. You have to learn really a lot of stuff, and you have to install a lot of stuff. And in order for to have a proper monitoring solution, you need skills. The thing that I call the DevOps utopia.

Costa Tsaousis
So if you go to the analysts, et cetera, the analyst sites, et cetera, they will tell you that a DevOps is a person that has amazing development skills, amazing infrastructure, understanding skills, and amazing data analytics skills. Data analysis, well, these guys do not exist. You cannot find someone that is an amazing developer, a super duper sysadmin, and at the same time a data analyst. So this is a utopia. This is something that cannot happen. And the moment we tried to put the knowledge into the tool, then things became a lot more simpler for people. Suddenly, you can use a monitoring solution that is as comprehensive, as complete, as real time. Ask, for example, the monitoring systems of Facebook and Netflix, etcetera, and you can have it on your computers, you can have it on your servers. It’s there, it’s free, it works by itself.

Costa Tsaousis
It has the knowledge, it knows what the metrics mean and what they do and how they should be correlated. So this gives you the freedom to work on the actual problem that you have, setting up the infrastructure or improving the performance or doing the actual work of your job instead of babysitting the monitoring system. So I think that this made data. People love it. I think that this is the main reason people love it, mainly because it’s so easy, it’s so complete, it’s so out of the box, and it gives them very good solutions, even to troubleshoot stuff that they never knew exist.

Brett
What about your market category? Is it observability? Is it infrastructure monitoring? How do you think about market category?

Costa Tsaousis
Observability is a very broad idea. So it has from infrastructure monitoring, of course, it has APM, the applications, performance monitoring. So if you are building your own application to monitor it and improve it. And it has other stuff. Network monitoring is also there. Traces is also there about microservice environments, etcetera. We focus mainly on the infrastructure monitoring. So logs and metrics about infrastructure. We don’t do traces yet, but I think all of them are behind the scenes, are interconnected. As a startup, we had to start from something, do it well, and then proceed and innovate us as time passes and as revenue comes in and we have more funds, et cetera, and we can address a broader market. So, of course, observability is a market that is congested. It is. Everyone has, even the companies develop their own observability platforms.

Costa Tsaousis
I think that we stand out mainly due to the out of the box functionality, the very soft, let’s say, learning curve, and the completeness that we provide. That is extremely hard to achieve, even if you spend a ton of time to try to do it yourself. So all this combined make net data an amazing solution for the guys that, you know, the companies that they don’t have the time or the resources or the budget to go through a setup by themselves, a monitoring setup by themselves. You just installed in the data and you’re done. You just troubleshooting. You start troubleshooting immediately after.

Brett
Now, as I mentioned there in the intro, you’ve raised over 30 million to date. What have you learned about fundraising throughout this journey?

Costa Tsaousis
Oh, yes, fundraising. So fundraising especially. I was not familiar initially, I was not familiar with funding arena, let’s say, in the US. So it was my first time with net data. I would say that for me, it was relatively easy, mainly because the project has a lot of traction. So even several years ago, the numbers were amazing, the figures that we can showcase to investors. So this made it quite easy for Nadeta to attract funds. Now, I would say that the investors generally are very keen to invest. So they are trying to find the next unicorn they are seeking. They are looking for it now. The most important thing is to have something robust to showcase something amazing to gain their trust.

Costa Tsaousis
I think that the data helps a lot in this, mainly because all this community and all these figures and all these stars and all these downloads, you know, and all this kind of stuff made it very appealing for them. Of course, as the company matures, you are in a position where, you know, it gets tougher and tougher. So, for Madeira, for example, initially we started working as 100% remote working company. We are still 100% remote working company. But I have to tell you that this is probably the toughest thing among everything else. So building the product is not that hard. But managing a company that is 100% remote is probably the toughest. The problem is that you have to think a bit.

Costa Tsaousis
If you are in a company, you work in a company, and let’s assume that you are hired today and the company is 100% remote working. So you are at home and you say, okay, let’s start working. You understand that the company needs to have something in place to make sure that you are going to be productive at the end of the day. Because if you don’t have this, if the company doesn’t have this, then most likely you will be struggling. There will be a lot of noise or misunderstanding or things that you heard here and there that are not company decisions and they will be blocking your way and your productivity. So I think we learned this the hard way because the path for the data was tough and is still tough. I think the hardest problem to tackle is this remote working problem.

Costa Tsaousis
Now, apart from that, I think that mainly because the market is thirsty. So people need solutions. And we see this even among Fortune 500 companies. So we have a few of the biggest enterprises of this world that use metata, and we see that even these companies struggling with monitoring and they need solutions, they need tools. So what I think is that, let’s say that I was thinking this from the beginning, in the data is like we are racing against ourselves, we’re not racing against someone else, because the product is so unique. The easy of use and the automatic setup and the likes, all this kind of stuff are one aspect. There is another aspect of the product that is distributed by nature. So all monitoring solutions traditionally centralize all data to one place. So you have databases, even the commercial providers do this.

Costa Tsaousis
So they collect all the data and they centralize all the data. And then they have to manage this huge pipeline. What netdata does is that it allows the data to be distributed. So you install as many data agents as you need out there on all your servers. You can build as many centralization data centralization points as it is convenient for you. If you need centralization points, you may need centralization points because your servers are ephemeral or something. You have a kubernetes setup and nodes go up and down all the time. But if you don’t need centralization points, you don’t have to have. Then what we do is that when all of them connect together, they built a massive distributed database that is spread all over the infrastructure. So you may have thousands of database servers.

Costa Tsaousis
Let’s say that all of them become one instantly at the dashboard level. So we built everything required in order to have this distributed database functionality. And this allows people to be, you think, monitoring in completely different terms. The scalability issue is not there anymore. So you can scale to infinity, and still you don’t need to scale up the servers by bigger servers and the likes just for monitoring. You just add more, as many as you want. So the same. Such innovations happened in many, many cases. For example, also in the data, we have anomaly detection to machine learning AI.

Costa Tsaousis
In order to achieve this, for example, anomaly detection, we had to rethink what machine learning can do for monitoring, how it can do it, and how we can achieve a situation where with zero configuration, zero learn, zero training, zero involvement from the engineers to have machine learning and anomaly detection that is useful. So what we do there, for example, unlike everyone else, we don’t train models and then distribute machine learning models with the trainings that we have done. We don’t do that. So what we do is that we train machine learning models on each server. So each server collects its own metrics and it trains its own models at the edge. So once you train these models, then it’s easy for you to have anomaly detection at the edge.

Costa Tsaousis
And then what we do is that we say, okay, machine learning and animal detection for observability is noisy. So you have a lot of false positives. It’s not just because you have anomaly somewhere you have to wake up at 03:00 a.m. But what we did then is that we said, okay, when there is a problem with your infrastructure, something is faulty. Not one metric, but really a lot are going to have anomalies. So these things, the amount of metrics that go anomalous together is what triggers an alarm for us. So all the metrics have a little bit of anomalies here and there.

Costa Tsaousis
But when metrics, all these anomalies get synchronized and a lot of metrics have a lot of anomalies together concurrently, then for sure we know that there is something bad happening in the infrastructure and you need to wake up and fix it. So the idea is that we have to innovate in many, many different aspects, even on the visualization. How do you present this anomaly detection, this anomaly that we have for every metric now? How do you present it? So we had to innovate on the charts. We had to change how the charts are presented and how they can be used by users. The data is full of such innovations from the bottom up.

Costa Tsaousis
So from the installation, the configuration, the data collection, the database itself, the design, the topology that you can build with the data to the user interface, the charts, the dashboards, the alarms, etcetera. At the same time, this is fun because you are trying to solve a problem that is somehow solved today, but you are trying to solve it in a way so that you will save time from people. You will make people a lot more productive. I think that this is what is more appealing about the solution.

Brett
Amazing, Costa. We are up on time, so we’re going to have to unfortunately wrap here. I have a lot more questions I want to ask you, but we’ll have to save that for a part two sometime in the future before we wrap, if there’s any founders that are listening in that want to follow along with your journey, where should they go?

Costa Tsaousis
Of course, LinkedIn on X, now on Reddit, of course, and of course on GitHub. I work on, I code even today. A lot of stuff, most of my time is coding still.

Brett
Amazing, Costa, thank you so much for taking the time to chat. This has been a lot of fun.

Costa Tsaousis
Thank you very much.

Brett
All right, keep in touch.

Automating DevOps: How Netdata Is Redefining Real-Time Infrastructure Monitoring

Automating DevOps: How Netdata Is Redefining Real-Time Infrastructure Monitoring

Company

Podcast-as-a-Service

Popular Shows