What Does a Data Engineer Actually Do? (Explained Simply)

I just started reading Fundamentals of Data Engineering by Joe Reis and Matt Housley. I'm only a little way into it, so this isn't a review of the book. It's me trying to explain, in the simplest way I can, what the job it describes actually is, and why a tool like Arc exists to make that job easier.

Imagine a business as a big, noisy kitchen

Picture a restaurant kitchen during the dinner rush. Orders are flying in from everywhere: the front counter, the drive-through, the delivery app. Ingredients are coming in from different trucks at different times. Pots are boiling, the grill is going, plates are stacking up.

Now imagine someone has to stand in the middle of all that and answer one simple question for the owner: "How is the kitchen actually doing right now?"

To answer that, they can't just stare at the chaos. They have to:

Catch everything coming in. Every order, every ingredient delivery, every ticket.
Put it somewhere safe. Not just thrown in a pile, but stored so it can be found again later.
Organize it into something that makes sense. Turn "47 random tickets" into "we sold 12 burgers, 8 salads, and ran out of fries twice tonight."
Hand that organized picture to the owner, so they can actually make a decision, like ordering more potatoes.

That's the whole job of a data engineer, just with computers instead of a kitchen. Data (clicks, sensors, transactions, logs) comes pouring in from everywhere, all the time. Someone has to catch it, store it safely, organize it, and hand business leaders a clear picture of what's actually happening. The book calls this the data engineering lifecycle: data is created, it's ingested (caught), it's stored, it's transformed (organized), and finally it's served (handed to the people who need it).

Why this job is harder than it sounds

Here's the part that doesn't show up in the kitchen metaphor right away: most teams don't have a "catching the data" problem. They have a "storing too much of it, forever, for no reason" problem.

Going back to the kitchen, imagine if nobody ever threw out the old order tickets. After ten years, there's a mountain of paper in the back room. Finding tonight's numbers means digging through a decade of receipts first. The kitchen isn't broken. It's just buried.

This is exactly what happens to a lot of companies' data. Years of information piles up because nobody ever decided what to keep and what to let go of. Looking for today's answer means sifting through years of stuff nobody needed.

Where Arc fits into this

Arc exists for one simple reason: make it cheap and easy to store everything, and keep it queryable, so the engineer never has to make that ugly choice between "keep it all and pay a fortune" or "delete it and hope you don't need it later."

A few ways it does that, in plain terms:

It writes data in an open format (Parquet) on storage you already own. Like keeping your receipts in a normal filing cabinet you control, instead of paying a storage company a monthly fee just to look at your own paper.
It's fast enough to catch a huge amount of data without falling behind. Like a kitchen that can take orders during the busiest rush without ever telling a customer "sorry, we're too slammed to take that."
It speaks standard SQL, so the engineer doesn't need to learn a whole new language just to ask "how much did we sell last Tuesday."
No lock-in. The data is always yours, in a format anyone can open, not trapped inside one company's system.

Why I think this matters

A data engineer's real job isn't moving data around for its own sake. It's making sure the business can actually see itself clearly. Arc's whole mission, as I understand it so far, is to take the most painful, expensive, repetitive part of that job (storing huge amounts of data safely and cheaply) and make it boring and simple, so the engineer can spend their time on the part that actually matters: helping people understand what's happening in their own business.

The real problem isn't the data. It's the habit.

Here's what I keep noticing as I talk to teams: most companies aren't paying a fortune for storage because they truly need to keep everything forever. They're paying for it because of how the system was set up years ago, and nobody's gone back to question it since. It became "just how things are done here," not a real decision, just inertia.

That's the expensive part. Not the data itself, but the habit of never re-examining it.

With the right support, most teams can cut that cost dramatically, not by throwing data away, but by storing it the smart way instead of the expensive way.

If that sounds familiar, if your team is storing years of data the same way it always has, without anyone asking why, that's worth a conversation. I'm happy to look at it with you and see where the savings actually are.

I'm early in the book, and I expect my view here will get sharper as I keep reading. But even this early, it's already given me a simpler way to explain what I sell, and why it matters.

linkedin.comDiego Reyes, Enterprise Data Solutions ConsultantIf your team is storing years of data the same way it always has, let's look at where the savings actually are. Connect with me on LinkedIn.https://www.linkedin.com/in/diego-reyes-penaloza/

Analytical Database

Streaming

AI Memory

By industry

Explore

Read

Migrate from…

Forum

Source & Issues

Real-time chat