Planet Odoo

Damage Control: What Could Possibly Go Wrong?

May 30, 2023 Odoo Season 1 Episode 18
Planet Odoo
Damage Control: What Could Possibly Go Wrong?
Show Notes Transcript

When things go wrong, it's always better to have strong foundations and be prepared to fight back.

In today's episode, Denis Ledoux, a software developer at Odoo, comes to give us tips and tricks on how to prepare to face hackers' attacks but also how Odoo did to recover from all the attacks and server crashes they encountered in the last few years.
______________________________________________________

Don’t forget to support us by clicking the subscribe button, leaving a review, and sharing your favorite episode!

- See it in action by trying Odoo: https://odoo.com/trial

Concept and realization: Manuèle Robin, Ludvig Auvens, Marine Louis
Recording and mixing: Lèna Noiset, Judith Moriset
Host: Olivier Colson

Denis Ledoux:

Damage control is the fact that we know some failures are inevitable. Take the matter seriously and do not neglect it. You should do it now. It's not because it's not happening today that it won't happen in the near future. I would say, quoting John Cena, if you don't learn from your mistake, then they become regrets. The longest availability interruption we had lasted five hours. This is huge. Like it never happened later. This is the biggest one we had. So the world data center burnt and was thrusting the data center and didn't do any backups. And in this case, it even shows that you need to do backups and in different places, ideally in different continents. When you have only one server, it's okay because if it goes down, you see it immediately that it went down. But if you have ten servers, 100 servers, or 1000 servers, you cannot know which server is going down. So you need mechanisms to be able to know when a server goes down or when something happens. Human error hackers or system failures can cause chaos in your enterprise. And so you need to train and to be prepared for them.

Olivier Colson:

Welcome back For another tech and dev episode. This time we'll explore everything about damage control with our firefighter developer Denis Ledoux. In the fast-paced and constantly evolving world of tech, even the best-laid plans can sometimes go wrong, causing unexpected and costly damage. Whether it's a security breach, human error, or a critical system failure, effective damage control strategies can make the difference between a minor setback and a full-blown disaster. From the time we had a fiber cut at the other end of the world, or when we negotiated with a hacker. He will guide us through damage control best practices to extinguish even the most raging fires. Hi, Denis.

Denis Ledoux:

Hi, Olivier.

Olivier Colson:

Nice to have you here.

Denis Ledoux:

Thank you.

Olivier Colson:

So before starting, I like to to begin the episode with a small introduction of who you are and what you are doing. So what can you tell us?

Denis Ledoux:

So I am Denis Ledoux. I work at Odoo in the research and development team as a developer, so for the past ten years and still counting. And the reason I'm here today is because I worked on the Odoo online cloud platform, the Odoo Cloud platform, the upgrade platform, while also working at the same time for the support.

Olivier Colson:

Yeah, you did plenty of things.

Denis Ledoux:

I did plenty of things. So I know I have plenty of shit to tell. And so I have known a lot of failures like hardware failures, server going down for no reason, customer complaining. So this is the reason why I'm here today to talk about damage control.

Olivier Colson:

And what is damage control Exactly.

Denis Ledoux:

So damage control is the fact that we know some failures are inevitable like servers will go down, people will try to hack us and succeed, like being able to make our servers down while we don't want that. So knowing the fact it's inevitable, we put in mechanisms to recover from those failures as soon as possible and as entirely as possible.

Olivier Colson:

Yeah, because I guess you have to ensure the availability of your service, your server, and you, well, you're doing things for security, but it's never enough. That's what you mean.

Denis Ledoux:

Yeah, exactly. So basically, whatever you do, whatever mechanism you put to prevent servers to go down or security you put in place for people not being able to hack you, it will still happen. So knowing that fact, we put some mechanism to recover as soon as possible. And for the cloud platform, we ensure availability of 99.9% of the time, which means basically that your database should be available always but 45 minutes a month. So it should be unavailable at most 45 minutes a month.

Olivier Colson:

If you sum up all the downtime must be.

Denis Ledoux:

But of course, we do much better than that. Most of the months, you don't have any unavailability for your server.

Olivier Colson:

Sure. And how do you do that? Because in practice, what can be done for that?

Denis Ledoux:

So what can be done for that? Is that one of the things we did from the very beginning of Odoo, because as we grew and as we had more and more customers and more and more servers, we put more mechanisms in place. But what we have from the beginning is like we have backups like everyone and those backups. So for each database on the online platform, for instance, we have a backup and even multiple backups on days, week, etcetera, and they are replicated on multiple servers across the world. So, for instance, your backup is hosted in Europe and also in America. So let's say there is an earthquake in Europe and all data centers goes down, then we still have an available backup in the America. That's the point of it. And then, as Odoo grew, we put other mechanisms in place, such as having a master and a slave for each server. So what does that mean? Is that for each server we have? Hosting databases of customers. We have also a second server that replicates everything the first does.

Olivier Colson:

So if anything goes wrong on the server, you still have the copy.

Denis Ledoux:

And Exactly. So for instance, if the hardware is slowing down or the memory is failing or any kind of reason, it could also be the Ethernet cable, which is unplugged because, trust me, it happens more than you think. We can switch in five minutes, even less than five minutes to the slave server, and then the slave server becomes the main server and the availability is back up very fast.

Olivier Colson:

It's funny what you're saying, because it's not the first thing that comes in mind when you say, Oh, something might go wrong with the servers because you mentioned hacking at the beginning, in the introduction. It might also be just human mistakes like that. When someone unplugs a cable, something is just broken on the machine, and well, it happens, it happens.

Denis Ledoux:

It's just human error. So sometimes it's hardware failure, sometimes it's just human errors. So, for instance, no longer than two weeks ago, it was just an employee from the data center. They just unplugged our server, but it just meant another server. It just did a mistake and choose the wrong server oopsie. And then an entire server goes down, and we have to do something about it.

Olivier Colson:

And typically, how long does it take for them to realize that there was an error like that? Is it always to you to say, Oh, by the way, my site is not working anymore? Or No.

Denis Ledoux:

The data center also is a monitoring service. So when something like that happens, they know as well. But it still takes like 5 or 10 minutes to realize and to be fast. Yeah, it had to be. It has to be fast. And sometimes we are faster than them.

Olivier Colson:

About replication with the master and slave. Is there sometimes something because this is something we are doing at ODU, is that something that the data centers offer to do that for you sometimes or?

Denis Ledoux:

Yes. So for instance, those master and slave servers, this is what we do for the Odoo online platform because we have it for years before new data centers offered services to do it instead of you on your behalf. Now for Odoo SH, we use Google Cloud, and on Google Cloud, you don't have to worry about that because the disk are not attached to the server, but they are on the network. It's network disk and this allow persistent replication permanently, so we don't have to worry about that.

It's the data Olivier Colson:

Center.

Olivier Colson:

That's like an abstraction, another layer of abstraction you're calling the disk, and what it is exactly doing, you don't know. But the point is that. Exactly. We have access to it.

Denis Ledoux:

Yeah, we don't have to care about it. It's just the data center that does it. Of course, it has a price. It's. It's. It's more expensive than regular servers, but it's a layer in less that we have to care about.

Olivier Colson:

And is it possible that the master and slave server are not enough and that the slave server is failing as well at the same time as the master server?

Denis Ledoux:

Yeah, it happens.

Olivier Colson:

Do we have a plan for that and how can it happen?

Denis Ledoux:

Well, it depends. If we know it will recover soon, like the data center will do something within ten minutes, then we don't do anything. We just wait. But then if it takes too long, like it takes five, six, seven hours, we have to take a decision and to decide, well, let's do a disaster recovery and take all backups from the backup server and restore them. But this has a cost because we don't do backups every hour or every minute. So there will be a loss of data. The last two, three, four hours will be lost for the customer. So this is a hard decision to take between losing data and recovering as fast as possible.

Olivier Colson:

It's like the red button, uh, in case of, of chaos push here and you have to, to know when, when then when it's chaos, actually. And thanks to.

Denis Ledoux:

The all other mechanism we have, we never had to use the disaster recovery up to now.

Olivier Colson:

Hopefully, it lasts. And uh, so you were mentioning the backup server. So how does it work exactly? So that's an additional layer that we didn't really talk about so far.

Denis Ledoux:

The backup servers? Backup server is just that every day or every 12 hours, the backup server can take the backup from each server so they make a backup of the databases, take it, and put it on another server, and that's it. So when we want to do a disaster recovery, we just spawn a new server and take all the backups from that backup server, and then we put it back in place, the services.

Olivier Colson:

And when there's an attack. So, for example, there's some denial of service going on with your provider and the servers are unreachable. Do we have something that can move the database somewhere or, or alleviate that in some way? Yeah.

Denis Ledoux:

So this is also something we learned from past failures. So before we were just moving databases by hand when something like that happened. But when I say that it's like ten years ago we were doing that and since then we just wrote a script that does everything for you and you launch the script and then suddenly your database is moved within seconds and another server and not just your database. It can be like 100 databases that are moved within five minutes and those things are automated and we just have to push a button and. Then it just goes on another server.

Olivier Colson:

So you have to monitor what is going on on the whole platform like all the time. How do you do that? So I assume you're not running commands on a server by server and taking the numbers that it gives you. Well, I.

Denis Ledoux:

Think when you are a small company, this is what you do.

Olivier Colson:

But when you have a bunch of servers, you can do that. But I guess not. Not anymore.

Denis Ledoux:

So when you have 1 to 10 servers, you can do that. But we have thousands of servers now, so.

Olivier Colson:

We can't hire thousands of people to do that server by server.

Denis Ledoux:

And we cannot rely on customers to say, Hey, my server is done. So we have a monitoring service. So before it was Munin that we were using it. So basically just a list of servers and then alerts to tell you okay, this server is down or this server is slower than usual. From Munin we moved to Grafana because it's more meant to be used by bigger companies for companies with more servers.

Olivier Colson:

Okay, you have more stats on it or it's not.

Denis Ledoux:

More stats, but basically on Munin, if we have 1000 server, it shows you a list of 1000 servers. Really. Like you go in your browser and you have 1000 servers and then you have to scroll to find the one which is having a failure. While on Grafana, you can make alerts and let's say that, okay, the request of customers are taking 10s, which is unusual, then it immediately gives you an alert telling you, okay, this server there is a problem. And so in a blink of an eye you can see what is failing on which server.

Olivier Colson:

And those tools, how do they work in practice when you need to deploy them on your side? How is it working?

Denis Ledoux:

Really when we have to deploy it on our servers? So we have just an automated script that install everything. Each time we spawn a new server, we have an automated script that does everything for us, and that way we do not forget anything. So it installed, for instance, Odoo on the server and then it also installs a node on the server which is gathering information, gathering data about the server, and then the master server or grafana come get those data on each servers and collect them, gather them and then make a statistics and a dashboard.

Olivier Colson:

So the idea is that you have a central server for aggregating the data that connects to the others and just takes the information that it needs on them. And on each server you have something running, gathering those information, Right, exactly.

Denis Ledoux:

And then it gives you a dashboard to see what is happening on each server.

Olivier Colson:

Okay. Is there another consideration to have when it comes to damage control for handling odoo in day-to-day life?

Denis Ledoux:

So yes. One example I think is email reputation. So when our customers, they have databases on our platform and they send emails, and those emails, they are sent by our email server. Okay. And one of the things that customers do not like is having their mail landing in spam inboxes of their customer. So what we have to maintain is a good reputation for our email servers.

Olivier Colson:

And how do you do that?

Denis Ledoux:

How do I do that first, how it happens that the reputation goes down. It's just people that send spam. So they use our platform to send spam email. So they basically create a new database and then send spam email to customers. And what we have to prevent that are multiple mechanisms. But the biggest one is that we limit the number of emails you can send for each new database you create on the on the on the online. So like eight years ago, this limit was at 400 emails. So each time you were creating a new database, you could send 400 emails. Nowadays we had to decrease this limit to 20 emails because it was really becoming a problem. Like there are people out there just making scripts to create databases, automatically send 20 emails, then create another database, send 20 emails. Et cetera. Et cetera. So even 20 emails is not enough. We need other mechanisms than this limit to maintain the reputation of our email servers.

Olivier Colson:

And I guess when you're running an actual business, you have to kind of ask for more emails because 20 minutes a day is not is not enough. How does it go then?

Denis Ledoux:

When you start to pay the limit, already increase automatically when you use your credit card, etcetera. If you need even more, then you can contact either your salesman or the support and the support will have a look to the email you sent. And according to the content of those emails, according to the context you send those email to, we will increase the limit a little by little because we need to trust you that you won't send spam emails.

Olivier Colson:

Yeah, it's like we are extra careful on what we accept and what we allow people doing with that because of this reputation thing.

Denis Ledoux:

And sometimes it's not that easy because you read to the content and you say it really looks like spam.

Olivier Colson:

But but it's actually.

Denis Ledoux:

But it's actually they send it to their customer list. So people that agree to receive those kind of emails. So it's okay, but it's hard to tell. Okay. They gathered that contact list by their customer, not they did not get that contact list from I don't know where from the from the web.

Olivier Colson:

From what you were saying. I get it that you have a little story with that, right?

Denis Ledoux:

Yes, I do. So I told that there are people writing scripts to automatically create databases. There is also one guy in particular that I met during my support years that was paid by another company to create manually databases, send 20 emails, create manually another database, sending 20 emails, etcetera, and even paid for that. So he paid Odoo a subscription to Odoo to be able to send even more emails. So this is how I met the guy. Basically, the salesman sent us a ticket on the Help support and the salesman said increase the number of emails of this database. So first thing I do, standard procedure. You go to the database and you have a look to the kind of email is sent to who. And then I just see that it's spam. It's clearly spam.

Olivier Colson:

It was Fishing.

Denis Ledoux:

I don't remember exactly, but I think it was about trading. So it's not.

Olivier Colson:

It often is.

Denis Ledoux:

It often is not always, but it often is when it's like, Oh, we will help you to trade. You will become a successful trader and have a lot of money. So this was the kind of email. And so first thing I said to the salesman is I think it's a stolen credit card because it's registered to a guy in Utah America. And regarding the kind of email, it's it must be a stolen credit card. And what I didn't see is that the customer was in the ticket and it wasn't a stolen card. It was really a credit card. And it was outraged that I said that it must be a stolen credit card. And the content of his email was spam. And then the conversation went and I realized that he was paid by a company to send those emails, but he didn't realize himself that it was spam. He said, okay, maybe it's a bit on the spammy side, but it's not. It's not actual spam. But clearly, it was. So at the end we had to just shut down his database and cut the conversation with him because he didn't want it to hear that it was spam.

Olivier Colson:

Funny. Funny because you'd expect those people to know what they're doing.

Denis Ledoux:

Yes, exactly. But the fact he was paid by another company is amazing.

Olivier Colson:

You have other stories of things that happen because, as you said, things happen. And I think it's interesting to have a bunch of examples of things that went wrong with Odoo and how they were managed by our team and just give examples to people of what we were saying earlier.

Denis Ledoux:

The longest availability interruption we had lasted five hours and this is huge. Like it never happened later. This is the biggest one we had and five hours we had to decide if we wanted to do that disaster recovery or not. But first I will tell the story basically. So it was in a data center in Canada and a car crash happened. And the car just cut an optic fiber so.

Olivier Colson:

Crazy you wouldn't expect them to be just next to the road.

Denis Ledoux:

Like that. You wouldn't expect that. But it happened, and there was no backup yet. There was a plan to have a second optic fiber going under the road, but it wasn't done yet. So it was the only optic fiber of the data center available. So as soon as it was cut, no more servers, no more network. The data center was completely shut down. You couldn't.

Olivier Colson:

That's crazy. It's just bad luck because that's that's one thing I remember from university and from the network courses is avoid having a single point of failure. And this is exactly what they had here.

Denis Ledoux:

This is exactly.

Olivier Colson:

What it's too bad that something wrong happened with it while they had a plan for a backup, but it wasn't there yet. Yeah.

Denis Ledoux:

And even remember that we we were a bit in a stress because sure, the services were interrupted and we just didn't know what to do. And having a look to the status of the data center company, we saw that it was an optic fiber cut in the north of the lake. And I read it wrong and I thought it was going through a lake and I was like already thinking there will be scuba divers with their tools repairing the cable in the bottom of the lake. Not not at all. But it was kind of funny. And, uh, so we had to decide what to do. So basically we were looking at the status page of the data center to see how things progressed, and we had to decide if we put in action or disaster recovery plan or we just wait. And at some point, after three hours of interruption, the data center said it should be back up in two hours. And we decided that, okay, two hours is better than losing customer data and the time we would have to recover all the databases. So we just wait for it, and that's it.

Olivier Colson:

And it end well at the end.

Denis Ledoux:

Yeah, it it basically so that backup optic fiber that had in plan, they just rushed the production of it, and so it was there in two hours. So they just rushed the production of that other optic fiber.

Olivier Colson:

Do such fiber cuts, uh, occur often and always with consequences like that?

Denis Ledoux:

The same consequences? I wouldn't say so because this one was like five hours. But we knew other fiber cut and the last time like two to. Hours or maybe 20 minutes sometimes. But yes, we knew a lot of fiber cuts. And for instance, for that data center in particular, we had four of them in the same years and the same customer was having those fiber cut each time. So it was like, does that happen always like for fiber cut in the same year?

Olivier Colson:

I guess for Odoo's reputation is not it's not good when it happens to the same customer because it's hard explaining to people. But you know, it's not us, it's our provider. Because at the end of the day, for them, it's like you use the Internet every day. You don't you don't think about the data center, you don't think about the servers. For you, it's something magic going on in your phone or your computer, and you expect it to just work actually.

Denis Ledoux:

Exactly. And you expect that it will never fail. You trust the data center? Sure. And then it happens.

Olivier Colson:

And then it happens.

Denis Ledoux:

And also, we ensure there will be an availability of 99.9% of the time like I said, 45 minutes a month. And for that customer, it was more than that in a few months. So this was kind of surprising for him.

Olivier Colson:

Okay. So you had to explain. Yeah. That, you know, it's data center.

Denis Ledoux:

I remember in the ticket that I had to explain that okay, this is very exceptional. This never happens and this actually never happens. This is the only year so many fiber workers.

Olivier Colson:

Do you have another story like that with a material problem that impacted our servers?

Denis Ledoux:

Yes, I do, of course.

Olivier Colson:

Tell us.

Denis Ledoux:

So as I told earlier, we have a master and slave mechanism. So if the master goes down, we can switch to the slave quickly. And what happened one time is that we were really unlucky and the master and the slave were hosted in the same rack on the same switch, on the same router, and the failure was the router. So the switch goes down and then both the master and slave go down. So we tried our best to find a good mechanism and then just out of bad luck, both goes down at the same time. So it happened. Thankfully for that time the switch was back on in ten minutes, so it was an interruption of ten minutes. But this is still funny to tell that whatever you think, you think you are good because you have a master and a slave and then suddenly only the rack of that data center goes down and your master and slave are on the same switch. And so they both go down at the same time. It's just bad luck.

Olivier Colson:

As we were saying, actually, a data center is a huge structure with a huge complexity actually behind it. A lot of cables or of things to handle and everything can go wrong. So just to be clear to people, because we are like saying, oh, and this failed with the data center and this is well, this as well. But of course, it does because it's super complex and not everything can go fine all the time, everywhere in the whole thing. So yeah, it happens. That's it.

Denis Ledoux:

Hardware failures are just inevitable. So you, you need to become prepared and that's it.

Olivier Colson:

Do you know, maybe have a story about an attack that occurred on our servers that you can talk about or is it confidential?

Denis Ledoux:

There are some parts that are confidential, but the most picture I can.

Olivier Colson:

Okay. So tell us what's not top secret. Yeah.

Denis Ledoux:

So during the pandemic, we had a hacker that tried to ddos us and he didn't try. He successfully did it. Okay.

Olivier Colson:

So could you maybe recall to people what ddos is just to be sure everybody gets it.

Denis Ledoux:

Okay so in this case, it's sending a lot of requests to your web server so you have so many requests that you cannot handle it. So the server goes down out of performance, like it slows down because it just received too much amount of requests. And in this case, it was a botnet. So basically that hacker was able to take control of a lot of computers in different parts of the world. Like it could be your mom's computer, you don't know. And then he starts to send a lot of requests to your web server and you just cannot ban the computer sending requests because they come from all kinds of IP ranges from all over the world. So you cannot just ban those IPS. And so this is one of the hardest types of attack to counter because you just cannot do a lot of things because it could be actual customers. You cannot know. It's just unusual the amount of requests you receive.

Olivier Colson:

And so as your infrastructure is not expecting that much traffic, it just can't respond anymore to regular customers and actual people wanting to connect to it. Exactly. And so what happened?

Denis Ledoux:

And so what happened? So the first thing is that we received the tweet. Just a tweet, I think. I don't think it was a direct message. I think it was just a tweet. So it's a guy telling us, okay, I have seen that you are under attack. I will be there to rescue you.

Olivier Colson:

So. Wow, Superman.

Denis Ledoux:

Superman. He says that he is the hero in the story, but it will help us against$3,000 in Bitcoin.

Olivier Colson:

Less Superman.

Denis Ledoux:

Superman, it becomes more shady. So obviously, it was the attacker and it wasn't attacking only us. Just before us, it attacked another company somewhere else in the world. And that other company contacted us to tell, okay, we've just been attacked by this guy here what you can do. This is already very nice of that other company.

Olivier Colson:

Very good that it did that.

Denis Ledoux:

And one of the thing that companies said is that you can talk to that hacker, you can try to reverse social engineer him directly. So basically, while other people were trying to figure out a technical solution to the problem, I was in charge to just talk to the hacker and try to make it stop.

Olivier Colson:

And did it work?

Denis Ledoux:

It worked. So I said, Hi on Twitter, what are you doing? And then it didn't check what kind of company we were. It didn't see that we were a computer company doing software and everything, so it didn't expect us to have computer engineer. So I just pretended that we were a simple company with no knowledge in computer sciences.

Olivier Colson:

You were playing the idiot.

Denis Ledoux:

I played the idiot and it worked. So at some point, he was telling, You have to pay me Bitcoin. I was like, What? What is bitcoin? I don't know how to pay Bitcoin. So how does it work? And then at some point, I even sent him a quote from Odoo directly. So can you sign this quote please?

Olivier Colson:

Because how long did it last?

Denis Ledoux:

It lasted five hours.

Olivier Colson:

It's crazy because, actually it reminds me there are a bunch of YouTubers doing this kind of thing. You know how well on the phone and calling spammers and just making it last as long as possible. And it takes hours all the time. And it's crazy how long it can take and how long they can just keep on trying. And if they just took some distance, they could see that the guy is clearly making fun of them. And so here it's good because the time he's talking with you is not trying to improve his attack or react to what you're doing or whatever.

Denis Ledoux:

So at some point I said, I need proof that it's you who is doing that. And I said, Can you stop your attack? So he stopped during 30 minutes, something like that. And during those 30 minutes we were able to get back to our servers. And Olivier and Julien are there employees at Odoo? They were looking for a technical solution and thanks to that interruption, they were able to find the actual issue, which was a bit technical, but basically it was the load balancer which was configured to handle requests on only one core. Well, they could use multiple cores. So once they enable for the server to handle requests using multiple cores, it was solved. And whatever he was doing, after whatever number of requests he was doing, the server could handle it just fine.

Olivier Colson:

Crazy.

Denis Ledoux:

So it's thanks to that interruption of 30 minutes - 1 hour, something like that.

Olivier Colson:

And after, when he realized that you had found a way to just stop the attack, did he send something more or did he tell you how you took me for an idiot? Or was it just no message at all anymore?

Denis Ledoux:

The thing is, it is that it was in the middle of the night and we think that guy was German. So at some point he just went to sleep. And when he got back from his sleep, the problem was solved. So he just said hi, hi, hi. And I never responded. And that's it.

Olivier Colson:

Oh, you lost a friend?

Denis Ledoux:

Yeah, I lost a friend.

Olivier Colson:

So it's interesting because when you think about attacks, you don't think about the human side of them. So here it's clearly there was a guy behind it and just using it as a diversion and talking with him helped a lot. Really?

Denis Ledoux:

Yes, it did. And this is thanks to that other company that told us that we could talk to that hacker and try to make it stop.

Olivier Colson:

Well, actually, yeah, because it was really kind of them to just warn you because they just could have.

Denis Ledoux:

And I don't I don't think that without that other company, we would have thought to talk to the, to the guy. We were just trying to find a technical solution but not to talk to the hacker directly.

Olivier Colson:

But next time it happens, it's the first thing you're going to try, right?

Denis Ledoux:

Maybe.

Olivier Colson:

Okay. Now, would you have maybe advice is not that we explained what Odoo is doing for that for people who would like handle a bunch of servers and well, their service is going well. It's growing and they need more servers and they need to deploy things well. They know that they need to do something about damage control and to ensure that the service can go smoothly if, as we say, something wrong happens or when something wrong happens, would you have advices for them?

Denis Ledoux:

I think that the first advice is to take the matter seriously and do not neglect it because most of the time, as you do not think it's an imminent threat, like it won't happen today, you will think, okay, I will do it later because today I need to reply to that customer or I need to make that customer paying for Odoo. And you don't think about the future, what could happen and so you procrastinate and make fix later. And this is not something you should do. You should do it now because it will happen. You know it will happen. It's not because it's not happening today that it won't happen in the near future.

Olivier Colson:

So it's not because something seems more urgent that it actually is.

Denis Ledoux:

It doesn't seem urgent because it won't happen today, but it is urgent that you do some to take measures. And then another advice is to do what we do with our monitoring service. Like when you have only one server, it's okay because if it goes down, you see it immediately that it went down. But if you have ten servers, 100 servers, 1000 servers, you cannot know which server is going down. So you need mechanisms to be able to know when a server goes down or when something happens like sports performances and things like that. So one example I have in mind is that a data center that just burned. So the World Data Center burned.

Olivier Colson:

That was a big story.

Denis Ledoux:

That was a big story at the time. And there was a lot of there were a lot of people that didn't do any backups. I was trusting the data center and I didn't do any backups. So that day a lot of people lost their data because of that fire in that data center. So this is one example that why you should do backups. And a lot of people don't think about those backups, but you should do backups. And in this case, it even shows that you need to do backups and in other data centers, because if you did the backup in the same data center, it would burn as well, burnt as well. So you need to do backups and in different places, ideally in different continents.

Olivier Colson:

Yeah. At the end of the day, we could summarize it by saying that people maybe need to keep in mind that, you know, we're always talking about the cloud. And this is a very it's marketing term that works really well. But well, in practice, the cloud is is not as resilient and as magic as a cloud of data is. It's a huge set of machines working together with all the things that can go wrong and including, yeah, the fire.

Denis Ledoux:

And you wouldn't think that the data center can take fire, but it can. It happened. And then it makes me think of another example. It's the GitLab example that they did a mistake and they lost their production database. And so they had to use their backups and they thought they had backups like us. We, we have backups. So if something happens, we can just restore them. The thing is, they couldn't use those backups. They couldn't restore them, so they thought they had backups, but in reality, they couldn't restore them.

Olivier Colson:

How come there was like a bug in their script or something wrongly encrypted? I guess with the backups or I.

Denis Ledoux:

Don't remember the.

Olivier Colson:

It was really something they couldn't they couldn't do. It was. It was not just a dot, uh, missing in the file. Uh, it's, it's really the backups themselves.

Denis Ledoux:

The, the backups themselves were not restorable. That's it. I don't remember the, the full details, but that's it. They couldn't use those backups to restore the services. And so that day they lost 300GB of data. So this is quite something. And so one advice is that. Even if you have backups, you need to train and do training to see that you can restore them because backups don't matter if you cannot restore them.

Olivier Colson:

And it's something that might be true for, I guess, all the failure plans you can have, not only backups, I mean, you have to check they work actually.

Denis Ledoux:

Exactly.

Olivier Colson:

Do some kind of training other companies that do this kind of thing that you know about and what do they do for that? Uh, because it's it's easy to say you need to train. Okay, but how? Because you're not going to break your own servers, right? You're not going to set fire to the data center just to check. Oh, is it working?

Denis Ledoux:

So the first thing I will give as an example is us. We have a disaster recovery plan and we try it regularly. Like we have an employee who is in charge of, for instance, restoring a world server using the backups. So we have that kind of disaster recovery training and this is important to have. And then another example is the Chaos Monkey of Netflix.

Olivier Colson:

The name is super catchy. It's super.

Denis Ledoux:

Catchy. And so basically, what is the chaos Monkey in Netflix is that you can picture yourself with a monkey with a hammer and just hitting the servers. And this is the idea behind the Chaos Monkey. So basically he's in charge to shut down production servers on production environment. So it just shut down things and the services must be able to recover from it by themselves. So it's a permanent training in production. So Netflix is doing that we don't do yet, but we would like to have some kind of chaos monkey.

Olivier Colson:

Okay. And if you sabotage your own thing, it's, uh. But I mean, it makes sense.

Denis Ledoux:

Yes, it makes sense. And those chaos in reality, it can come from human. It can come from hackers. And so human error hackers or system failures can cause chaos in your enterprise. And so you need to train and to be prepared for them.

Olivier Colson:

Okay. So we are reaching the end of the episode. So would you have maybe a phrase, a sentence, a quote to summarize everything we just learned today?

Denis Ledoux:

I would say quoting John Cena, wow, if you don't learn from your mistake, then they become regrets. And for instance, this is true for that burning data center. Those people that didn't learn that they need to make backups, they lost their data that day. And this is a great, great, great.

Olivier Colson:

Okay. So we learned a lot of things today from data center going in fire to attackers, talking a lot on Twitter. It was super interesting to see the behind the scenes of what's going on with all those magical clouds. Thank you for your answers, Denis.

Denis Ledoux:

Thank you for hosting me. Olivier. Bye, everyone.

Olivier Colson:

I hope you grasped the importance of how good damage control can avoid a total apocalypse of your service. If you would like to learn more about how you can protect your infrastructure, stay tuned as we will soon be releasing an exclusive episode on software security. Hopefully this episode gets backed up on multiple servers across the world so that you can share it with a maximum number of people. Until then, see you next time. Cheers.