KubeCon 2023: Cloud-Native Object Storage

Speaker 1: This is Techstrong TV.

Mitch Ashley: Hey, welcome back. We are here at KubeCon in Chicago 2023, and it’s been a busy day. We’ve had a great day. A lot of great people that we’ve talked to. We’re just on our last conversation. And we’ll be here tomorrow, so don’t go away. We’ll be back. But I’ve had the pleasure of being joined by Daniel Valdivia. Daniel is an engineer with MinIO.

Daniel Valdivia: That is correct.

Mitch Ashley: Good to be talking with you.

Daniel Valdivia: Yeah, thanks for having me.

Mitch Ashley: And not to give you more titles, but you’re one of the founder, originators of the technology, right? Inventor?

Daniel Valdivia: Well, I’m responsible for Kubernetes products and that could be understood as I’m one of the ones who were the operator for MinIO. So I’m responsible for Kubernetes stuff. So you could argue anything that has to do with Kubernetes, I wrote it.

Mitch Ashley: Okay. I know you’re going to try and claim all the credit, but give credit where credit is due, right?

Daniel Valdivia: Yeah.

Mitch Ashley: So if there are folks that don’t know about MinIO, talk about what MinIO does.

Daniel Valdivia: So MinIO, it’s a cloud native object storage. So for those who don’t know what an object storage is, it’s just the simplest form of storage that’s meant to be consumed over the internet. So MinIO is built for scale and speed. So when you see enterprises and companies that need to store large amounts of data, they look to a big storage. And particularly to MinIO because it’s so simple to run, so simple to operate, and that’s what we are going after. We want companies and teams to be comfortable running storage. So they want to look for a solution like MinIO that says it’s so easy, mind-blowing, simple. I just put my files out there and then I just consume them. And that’s it. That’s what object storage is, right? A solution for storing a large amount of data.

Mitch Ashley: It’s almost like NoSQL for programming, object storage for storage. We used to think of it as blob. Now it’s objects, right?

Daniel Valdivia: That is correct.

Mitch Ashley: It can contain anything, right?

Daniel Valdivia: Yes.

Mitch Ashley: And very large.

Daniel Valdivia: And very large, yeah.

Mitch Ashley: It’s not limited by so many character size or data size.

Daniel Valdivia: And then, one way to reason about it, why are we not calling it File? Because when File started, it was locally on your drive. And Object is meant to be stored in a distributed environment over the internet. So that’s probably the main difference.

Mitch Ashley: You can’t talk about applications, the cloud, virtually anything without the data. That’s what runs everything, right?

Daniel Valdivia: Runs everything.

Mitch Ashley: The fuel over applications.

Daniel Valdivia: The way I like to see it is, everyone’s building all these great products and I see them as cars. We build roads because everyone needs storage. Even if it’s a NoSQL database, they need to back up that data somewhere. And object storage is the place to put those backups.

Mitch Ashley: Very good. So in the Kubernetes world, we’re obviously here, there’s a lot of interest in data and databases, even databases sometimes in Kubernetes. Talk about how you fit into the cloud native ecosystem of applications, cloud native, database technology, distributed, processing and data.

Daniel Valdivia: So from day one, we built MinIO intended to be running a distributed environment. But then not so much as to only make it running in containers, but the mini binary, it’s so small, a hundred megabyte binary, you can run it on bare metal on virtual machines or containers. And the idea was that, okay, we don’t want to force you to say, oh, if you want to run this storage, you have to buy this appliance from us. When it comes to the cloud, you don’t know what you’re going to be running on. If it’s on a cloud provider, it’s on their infrastructure. If it’s your own Kubernetes on the hardware that you decided to buy and you buy the service that you want.

So we want a storage solution that’s truly cloud native. That means it will run on anything that you have available. So let’s say you went with the vendor of hardware that gave you the base deal. You want to be able to set up your own storage that way. So you go and set up Kubernetes, set up mini operator and start your own big storage company if you want. We want to make it that simple and that’s what it means to actually be cloud native. Be agnostic of the infrastructure, not tying you down to an appliance or any particular hardware.

Mitch Ashley: Yeah, very different than the I buy this NAS and so I manage the data a very certain way or whatever product. It seems also that what we containerize can be any variety of things, right? Data applications, infrastructure, full apps, microservices, and some of that information could be pulled out of database, out of object storage as well. So you mentioned the analogy of the file server. It seems like object storage sort of abstracts the underlying environment that this is running, whether it’s a file server or a NAS or a whatever in the world you want it is behind it. Or cloud service, it’s still an object storage to you.

Daniel Valdivia: It’s still an object storage. Yeah, exactly. And the applications are being built with that in mind. At some point, if you’re running, let’s say some big data pipeline on top of Spark, the Spark really doesn’t care if it’s pulling data from HTFS or some object storage. In this case it could be MinIO, it could be a cloud provider object store. So these pipelines are, all they care is they are agnostic. They just want to load the data and then go do its own job. So that’s where MinIO comes into place in letting you set up the infrastructure yourself.

Mitch Ashley: Now you work with other database technologies not as a replacement to necessarily, right?

Daniel Valdivia: What we’re seeing is data scaling an unprecedented space space. So database vendors they know is relational database. They start having problems when they go beyond certain terabyte capacity. But a lot of database vendors, what we’re seeing is they’re actually embracing object store. What they’re doing is they’re offloading their tables into object storage. And when you need to run certain type of queries or analytic workloads that’s meant to run on petabytes and petabytes of data, the database will actually go and retrieve its own right ahead loads straight from object storage. And then do its own query and then give you the result. And then we see that trend coming. Databases are actually evolving to work with these large massive amounts of data, a scale of the data. And that wouldn’t be possible if you didn’t have a place to put the data at that scale.

Mitch Ashley: Interesting. It’s almost like the tiered approach to hardware data storage. It’s from a software standpoint, database, right? In memory on disc, whatever disc is and then object.

Daniel Valdivia: Exactly.

Mitch Ashley: Pull in large amounts of data.

Daniel Valdivia: That’s definitely the right analogy. So just as people were doing on top of hardware, now they’re doing it both in data side. So you see also, for example, now that all the AI world is taking off, you see these companies that are, okay, we have all these data sets, but the data sets are massive. They keep growing month to month. But we only need three months to train the latest and greatest machine learning model. And when the data is staging, we need to tear it off to some other type of storage.

So object storage can be both the hot tier and the warm tier, in the sense that you can set up MinIO tenant with the latest and greatest MBME drives that satisfy your training needs. When you’re training a machinery algorithm, it needs to go through every single part of the dataset. But as the data is aging, you want to be able to tier that into another object storage. It can also be MinIO that that’s built with for chip and deep storage using on top of hard these drives. And all of this needs to be transparent to the applications because the application doesn’t want to know that I need to go to two separate places. I just want to read. Right? I just want to read my files and this is what’s actually coming at the trend dev model and data lakes.

Mitch Ashley: Makes a lot of sense. Why is object storage? So I mean we’re all very used to structured data in many forms. And AI systems can use with structured unstructured data. Why is less structured or unstructured data important?

Daniel Valdivia: So the interesting thing is about roughly like 80% of the data that we see in place on object storage is unstructured data. It’s machine generated. Companies have all this telemetry, all these logs, all this data about their business, and they don’t know what to do with it yet. So they start storing it. That’s why most of these data is structured, right? All the structural data still relies on relational databases and transactional operations that still belongs to the databases. But all these unstructured data that sometimes companies are like, “Ah, I wish I had the data so I could come and predict this behavior so I could improve my e-commerce business.” But if they kept the data in object storage, then they can be like, “Oh, now I can write a pipeline and go and deduce this new metric that I want to deduce.” So that’s why all these data companies are just dumping it into epic storage. And either they have a valid use case for it now or they don’t know if they want to develop a new business case for it later.

Mitch Ashley: I think it’d be easy and inaccurate to conclude, just because it’s object or large amount of data in object storage doesn’t mean it’s slow. It’s actually you can get high performance out of that data as well. How do you do that? How do you deliver the kind of performance that applications demand today?

Daniel Valdivia: So I was mentioning that by nature we build MinIO to be highly distributed. So that already tells you that we are actually aggregating the throughput of all these machines, of all these drives. And on top of that, we went to great lengths to optimize the parts that matter. For example, how MinIO protects the data, it uses to ratio coating. This means you’ll give me some file and I’ll take the chunk into data and then generate some parity and store it separately.

To do that, that’s computationally expensive. But we actually went to great lengths to write everything in assembly for this heavy lifting operations, encryption, TLS and ratio coding so that it’s actually, the hardware is not in the way when we are actually trying to store data or retrieve data at very high speeds. So actually the advantage of doing that is we can grab all these nice instructions. For example, on X86 architecture, we have [inaudible 00:10:07] instructions. On Arm, we have neon instructions. And all these instructions, they compute very fast. So we wanted to be able to leverage those so that when you are actually want to retrieve a file, we can pull the file as fast as the drive is letting us pull it.

Mitch Ashley: Almost just pulling data off a disc essentially, right?

Daniel Valdivia: Yes.

Mitch Ashley: What data do you want to pull off? What do you see happening in your space? I mean, over the next… I don’t want a five-year prediction, but what do you think over the next six months or more? What’s happening in your part of the market and what kind of things should be on the lookout for?

Daniel Valdivia: So we’re seeing two main trends. Firstly, our companies are building now, they’re trying to consolidate their data lakes into like why do I need to have different storage technologies for different problems when I can just have one object store that works great. A single namespace, everything I need is there, all the security, all the tiering, all the encryption. And then just use that. So first we see companies consolidating their data lakes for analytics, for AI, for databases as well, for all sort of systems. We even see streaming services building on top of object storage now because they discovered that it’s actually cheaper to run myself. And the second one, we see the repatriation of data from the cloud back on-premise, right? And Kubernetes is driving that because people are noticing, now I can just throw a bunch of servers, put Kubernetes on top of it and then orchestrate everything very simply.

And all being native to Kubernetes as well. Makes it trivial for people to say, “Okay, now I can run the show. Now I can bring my data.” Because on cloud providers, when you go beyond certain scale in the petabyte scale, the cost becomes prohibitively high, right? You could be buying a lot of servers every month with the bill. So when companies find this, they want to build their own data lake and this is part of the first point. And then they start repatriating some of the data and keeping it in house because they discovered how easy it’s actually to run your own storage infrastructure.

Mitch Ashley: You mentioned distribution, distributed data is one of the ways used to get high performance. There’s a lot of emphasis on resiliency, taking more than just uptime and five nines and meantime to recovery, but really trying to build into architecture or the infrastructure layer, cloud service or the application or all the layers of finding ways to make it so that our applications or infrastructure don’t fall down when some fault occurs or cyber attack or a bug in code. How does MinIO help with providing a more resilient infrastructure or application or both?

Daniel Valdivia: So we help you protect the data in many ways. For example, starting with data resiliency, the fact that we are cloud native and hardware agnostic. What if one of your servers goes up in flames? You want to replace it as soon as possible? And the fact that we’re not telling you buy our box, it’s like you can just buy whatever box fits the need that you need, place it back and plug it back into the cluster and we’ll start healing the data on top of that machine. So that’s one. Being hardware agnostic really, really helps in case you have these kind of scenarios where you need to replace hardware or just expand. So you don’t have to wait for us to procure us. You go with whoever can give you the servers that you need at the moment that you need them and then just expand your infrastructure like that.

So that’s one way we help you protect for against individual hardware failure. When it comes to, for example, ransomware attacks, object storage, because we have complete control over the data, you can actually set it up and we have, for example, object locking to prepare data deletion in case there’s a ransomware attack. The attacker can try to go and encrypt everything, but if an object has locking or versioning, you can always go and restore the other version or the file was not able to be deleted or changed at all. All data in object storage is immutable. So if a ransomware attacker wanted to really damage you, they will have to download the file, encrypt it and upload it. That will be time consuming, but they will just try to delete it. But if you have versioning enabled, you can just restore the other version.

And what we see also is when you see all these companies building their own data lakes, backing up a hundred petabyte lake is very… Not impossible, but it’s very hard. So we see companies building multiple sites now. So sometimes they like building two data lakes, have them replicate active active, East Coast, West Coast kind of deployment and building a third site just for data recovery. It’s just a recovery in case there’s such types of attack. And then they set up the replication and MinIO makes it trivial to save the active, active, close a DR site kind of setup so that you don’t have to worry about, okay, now I need to back up my a hundred petabytes. There’s no backup. There’s another site with an exact replica of your data.

Mitch Ashley: How about the developer or the platform engineer experience? If you aren’t Greenfield, you’ve already running an application, an environment, a database, maybe object storage. How do you add to that environment without being disrupted?

Daniel Valdivia: So that’s a great question because most of the MinIO adoption has been driven by the developers and the infrastructure people themselves, right? In traditional IT there’s always a team owning storage and they’re the very seller of it. And that’s because of course they bought these expensive appliances, so they want to restrict how it goes. But when it comes to a developer, he usually starts on his laptop and runs MinIO and says, “This works. I developed my application, test it against the API locally and now I want to deploy it.” So deploying MinIO could just be a simple pot and a single PVC, so they can go to infrastructure and be like, “Deploy me this application.”

For the infrastructure people, this is just one more application. Because it doesn’t need specialized hardware or anything. And that’s how it starts, through their shadow IT. And then suddenly slowly companies are realizing, “Oh, we are relying too much on that service. What is that? Oh, that’s MinIO, that’s subject storage.” And then they start formalizing into their companies. So for infrastructure people and developers, the fact that they can run a copy of the software themselves locally or inside their CICD pipeline to actually harden the software and be like, okay, it’s supposed to be testing it, but it’s working as expected all the time. It’s how we actually have gotten one foot into the door most of the time.

Mitch Ashley: Very cool. Lots of good stuff happening. It’s great to see. I remember when object databases were kind of a new thing and now object storage. So if somebody wants to kick the tires and try it out, Min.io? Is that correct?

Daniel Valdivia: Min.io. That is correct.

Mitch Ashley: Is there like a free account or tier that you can test things on or sandbox?

Daniel Valdivia: Yeah, actually we are open source. So you can go to our website and download the binary itself and run it on your laptop. We run on all kinds of architecture. And that’s the binary is the one that you will use on production. And if you were to acquire an enterprise license, that would be this environment that you run. Why would we do that? So we are an upstream first company and the open source community actually contributes a lot of fixes. If we were to break something, within 30 minutes someone on GitHub will be complaining, “You guys broke this obscure application that needs this obscure corner case.

Mitch Ashley: Now it affects everything. I mean, could have a huge impact.

Daniel Valdivia: And we immediately patch it, make a new release. We make releases multiple times a week if needed. And then that benefits everyone, both enterprise customers and the community. So that’s the best way to start. To go to our website, get the binary and start using it.

Mitch Ashley: Developers love free open source. No sales pressure, right? Let me try it out. Let me figure it out. If I like it…

Daniel Valdivia: Try it. Try it. Exactly.

Mitch Ashley: Good. Well Daniel, it’s been a lot of fun talking with you.

Daniel Valdivia: Thank you.

Mitch Ashley: Catching up. I think it was Valencia when last time I talked with someone from your company.

Daniel Valdivia: Yes.

Mitch Ashley: A lot of good things have happened. So you have a good rest of the show. And people, if you want to check out MinIO, M-I-N-.-I-O. Super easy to get to. Sounds like super easy to download. Pick binary and poof, you’re running it. So thanks again.

Daniel Valdivia: Yeah, thanks for having me.

Mitch Ashley: It’s been great having everybody here, all of our guests, people like Daniel and from other great companies as well as MinIO. I hope you’ll be here tomorrow. I think we’re starting at 10:00 AM tomorrow. Looking over at our production crew, giving me the big thumbs up. That’s on central time. So join us here on Textron TV and across all the Textron sites, DevOps.com, Container Journal Now, Security Boulevard, and Textron.TV. So it’s been our pleasure bringing this content to you today, these experts and thought leaders and information. We hope it’s been helpful. Reach out to us and let us know. And we have two more days of great stuff lined up. So we’ll see you back here. Same bat channel. Same bat station. We’ll see you tomorrow.

KubeCon 2023: Cloud-Native Object Storage

Share This Story, Choose Your Platform!

About the Author: Mitch Ashley

Containerized Application Management