Concepts you’ll need to get a grasp of within your first months
Ever walked around exploring an area you have never been to and then you somehow end up in a street you know? I’m sure you know that feeling. It’s like unlocking new areas of a map in a game. That’s exactly how I think about theoretical concepts you’ve heard before or read in passing and then one day someone explains it to you and relates it to a practical use case and you have that satisfying “Aha” moment.
As a software engineer you would have worked with user-facing or backend applications and at most had some interaction with a database, maybe you’ve sent events/messages to a stream or an API but in general, the whole data pipeline and how data is used is a black box to you, an abstraction you don’t need to understand. You might have heard about warehouses, lakes, Kafka etc, but there are enough languages, frameworks, and user requirements to think about so why dig deeper into something that is someone else’s responsibility.
When I took a role as a data engineer I had a lot of those Aha moments where I was piecing things I’ve heard before with actual use cases and a deeper understanding of what it is.
Here are some of the concepts that crystallized in my head as I was learning on the job.
I have seen those concepts explained in a way where it’s presented as a choice — you can use either and there are just pros and cons to each.
However, that’s not 100% true. In an ideal scenario, in a big data-led company you’ll have all, you’ll need all and each has a specific purpose. They cannot be used interchangeably, well they canbut it will be a bit forced.
This is a word that has morphed into an all-encompassing term — for business/non-tech people it’s a magical place where the data is and the data itself. For software engineers that’s the software that stores and manages the information.
- Databases are used to store operational data ie data that you need for your application to work every day.
- Databases are OLTP (Online transactional processing), which means they allow real-time execution of large numbers of transactions by many people simultaneously.
- Databases are record-based storage , which means that even if you select a single column in your query the entire record/row must be opened on disk. They are optimized for frequent updates and writesnot as much for reads.
Within the database category, there’s a choice of multiple subcategories open-source vs proprietary ones, relational vs no-SQL ones, in the cloud vs on-premise. I am not going to go into details about this since it’s deserving of at least a whole article in itself but sticking to the basics let’s look at this:
- You’d use relational databases for structured data with a predefined schema. Some widely used and familiar examples are Oracle, Postgres, MySQL, and CloudSQL.
- You’d use no-SQL databases for semi-structured or unstructured data. Within this category itself, there are multiple subcategories like
— — Key-value databases are simple storage of key-value pairs, usually used to store session data and user preferences. Examples are Dynamo DB, Redis, BerkleyDB
— — Document stores keep data in the form of structured documents like XML or JSON. Document stores make it easy to map Objects in the object-oriented software and they are usually used to store user profiles and content. Examples are MongoDB, CouchDB, Lotus Notes.
— — Graph databases serve the purpose of storing graph-oriented data structures like social network information, recommendation engine, and risk assessments. Examples are Neo4j, OrientDB, AllegroGraph.
It’s used to store large amounts of structured data from multiple sources in a centralized place. The process of creating a data warehouse requires a bit more work than a database. It needs to be planned out and designed carefully. The data in a data warehouse is usually preprocessed, clean and enriched.
- Data warehouses are used for analytical purposes, business intelligence, reposting or for ML models to make forecasting and there is no promise of real-time and completeness.
- Data warehouses are OLAP (Online Analytical Processing), which means they are meant to be used for multi-dimensional analysis at high speeds on large volumes of data.
- Data warehouses are optimized for reading and appending data, they don’t, generally, perform as good for frequent writes and updates
Some popular warehouses are Amazon Redshift, Snowflake, IBM Db2, BigQuery, Vertica.
A data lake supports all types of data – structured, semi-structured and unstructured. Raw data from a variety of sources is dumped into the lake without any processing or schema and only on retrieving you’ll apply a structure in a format you need for the particular use case.
- Data lakes store the data in the native format -blobs and files
- Data lakes are scalable and allow processing, ingesting, storing and analyzing any type and volume of data. It’s easy to add new types of data since it requires no additional work
- Data lakes’ raw data is usually supposed to eventually be processed and put into the data warehouse. Mostly, archives, backups, snapshots are the types of data that lives in the lake permanently
To sum up: in a full-fledged data pipeline you’ll have all 3, each serving a purpose and they would be working in symbiosis.
You’ve probably spotted the acronyms in the above figure. As a software engineer, I had a brief encounter with one of those but never had the time or need to dig deeper into it.
- This variation is used only when the source and target formats are perfectly aligned and the source data is clean, needs no processing
- easiest to implement — just grab the data from the source and put in the target
- usually used for a batch load of historical data or scheduled periodic loads of log files
- used when the source and target schemas do not match or the source needs clean up.
- the data is first loaded in the target into temporary tables or tables with generic columns where you can put a whole json payload for example or a permanent table and then create a view on top of it with materialized columns transforming on the fly with a view
- this is used when the transformation can be done as a SQL query in the warehouse (to use it’s optimizations and save costs and complexities)
- this is also a good option when you are not sure what transformation is needed
- this is the right choice when the transformation required is more complex and cannot be done as SQL query— like translation (calling different API)
- this option requires using additional tools for the transformations — an application or streaming framework, scheduling etc
- This is used when data needs to be enriched
- this is used when integration with CI/CD and testing on all components needed
Example transformations are:
- improving data quality
- checking validity — whether data is conforming to business rules
- Checking accuracy —whether data is conforming to an objective truth
- checking completeness — whether any data is missing
- checking consistency — whether the computations from the data are correct
- removing poor quality rows in views
- handling nulls and blanks, etc.
I am going to split this into a couple of parts since it’s a bit overwhelming to throw a bunch of concepts at once and will also give you the chance to go and read more about these if you want and need to.
Thanks for reading!