Join us
@data-engineer-365 ・ Jan 30,2022 ・ 15 min read ・ 1275 views ・ Originally posted on medium.com
SQL is one of the most-sought after tech skills in 2022 for data analysis and data engineering.
In this SQL story, we will look at the SQL programming language, the SQL basics and fundamentals in a practical manner. We will start with a brief introduction to the SQL programming language followed by an Overview of Databases and the Relational Database Management System — RDBMS.
This story isn’t death by powerpoint. We will learn by doing. We will install Microsoft SQL Server in a Docker Container to start learning. This installation section will provide you with a brief introduction to Docker.
Docker is an open platform for developing, shipping, and running applications. It provides an isolated environment to run applications.
Once we’re done with the SQL Server installation in Docker, we will then dive into heart of this story — Write SQL queries to access and analyze data.
Using a Healthcare System use case, we will create a database, create tables, and finally insert some data into those tables. This is where the fun begins — performing SQL operations on the data to learn SQL in a practical manner. We will cover SQL syntax, and SQL DML operations used to access, modify or retrieve the data from the database.
At the end of this story, you will be able to speak eloquently about SQL to your colleagues, and during your next interview and perform SQL operations to retrieve, update and insert data into a database.
SQL is one of the most-sought after tech skills in 2022, and is in high demand for data analysis and data engineering. With the growth of Cloud computing and generation of petabytes of data, SQL has become one of the de-facto programming languages to access and manipulate data stored in the Cloud. In addition, Data Engineers are required to know SQL. And according to Glassdoor.com, the average base pay for a Data Engineer is $102,864 / yr.
I suppose if you’ve read my bio, chances are that you may have an idea about my background. But, if you don’t, let me briefly summarize. I’m a Master Data Management Engineer and Healthcare IT Subject Matter Expert (SME) and have worked for fortune 500 companies such as GE and Caradigm for the last 10 years to design and build data products utilizing ETL, HL7 v2.x, eMPI, IHE on disparate RDBMS/OS platforms. I’ve worked with terabytes of healthcare data running in clustered environments in Amazon Web Services (AWS) and Microsoft Azure to enable secure, reliable and connected healthcare.
I’ve worked with complex datasets requiring building reliable patient matching applications for statewide data exchange. These large data applications require an understanding of all data characteristics and aspects. Prior to loading and processing large datasets (ETL), the data needs to be loaded into a database, and analyzed for consistency and accuracy and the SQL programming language is a go-to language to understand this data.
Rest assured, you’re in perfect hands
Applications such as Facebook, Google, and Netflix store information about users and products in data stores know as relational databases. A relational database is made up of collections of objects or relations that store the data. This collection of related objects are stored in a database table. Put differently, a database is a collection of two-dimensional tables. The table is the basic storage structure of a RDBMS. Each table is composed of rows and columns. Data in these rows is manipulated using SQL. The rows represent a single transaction, record or data entry. The rows are column values in a table. The tables are connected to each other using relationships. And that is why they are called Relational Databases.
User → Relational Database Management System (RDBMS) → Tables (relational tables) → Data
In a database table, the columns represent the properties of that data and the column values make up the row(s) in a table. In the same context, a field is the intersection between a row and column. It may or may not contain data (is null or empty).
RDBMS stands for Relational Database Management System — the software that manages relational databases. RDBMS manages the SQL code execution between the database and the computer system application. There are different types of vendors that provide RDBMS software and each flavor of RDMS is slightly different but its implementation is the same.
A few examples of RDMS are :
For our purposes, we will use the SQL Server database provided by Microsoft. There are free editions available which I will include in the video description. The same SQL principles should apply to other RDMS software used because SQL is an ANSI standard. All major RDBMS support some flavor of SQL.
Types of SQL Databases
There are two main types of databases.
A relational database is made up of collections of objects or relations that store the data. Related collection of objects are stored in a database table. Examples of relational databases are SQL Server and IBM DB2.
As the term implies, NoSQL databases store data as documents. They do not have relations. One example of a NoSQL database is Hadoop. The rise of Web 2.0 companies made NoSQL databases very popular due to the storage of unstructured data such as from IoT devices, entertainment and AI/ML. As datasets handled by Internet companies grew bigger in size, a new approach to designing databases came to the fore. The strict schema of a relational database was avoided in favor of a schema-less database. NoSQL databases come in different forms and address different use cases. The spectrum includes :
(1) Key-Value stores (Redis, Amazon Dynamo DB)
(2) Column stores (HBase, Cassandra)
(3) Document stores (MongoDB, Couchbase)
(4) Graph databases (Neo4J)
(5) Search engines (Solr, ElasticSearch, Splunk)
The primary distinction between NoSQL databases and SQL databases is the absence of a rigid schema in the former. Relational databases store structured data which conforms to a rigid schema. Whereas NoSQL databases store unstructured, and semi-structured that have a dynamic schema which fall under the purview of Big Data.
Healthcare System Database Use Case
Let’s take a look at a practical example of a hospital system. When a patient arrives at medical facility seeking medical treatment or for an appointment, a nurse practitioner or the help desk typically checks them in. The patient look up or new patient registration occurs in a healthcare application that stores patient data. When a patient registration or event happens, the Healthcare system assigns what is called a Medical Record Number — MRN. Each patient has a unique MRN in the registration application. A patient can have an insurance id number and an insurance group number. Thus, this establishes a relationship. A patient can have one or more medication(s). Thus, this establishes a relationship.
Healthcare System Tables
A Healthcare system database contains the following sample tables — a Patient table, a Medications table, or an Insurance table. These tables are related through certain columns — relations. A patient can have one insurance and a patient can have one or more medications.
In RDBMS software, the concept of Primary Key (PK) and Foreign Key (FK) denotes the relationship between entities or objects in a database.
The relationships are established through certain common columns. In the case below, a medical record number is a Primary Key (PK) in the Patient table. This PK is a Foreign Key (FK) in the Insurance and Medication table.
Insurance — Patient — Medication
Let’s take a look at a sample Patient table in a healthcare system.
The Patients table above has a primary key column. A Primary Key (denoted as PK) uniquely identifies each row in a table. It allows for fast access and data retrieval because each row is unique to that key. For example, when a patient is registered in a patient registration system, a medical record number often referred to as MRN is assigned. The medical_record_number in the Patient table is the PK. Each patient has a single (one) medical_record_number that uniquely identifies them — the PK. This same identifier column (PK) can be used in the Insurance or Medication table to identify the patients. However, it be a Foreign Key (FK) in these tables. Alternatively, a PK can be made up of more than one column. It can be a combination of two columns. The combination of the medical_record_number and personid columns could also be used to form a Primary Key (PK).
What is SQL?
SQL pronounced ‘SEQUEL’ stands for Structured Query Language. SQL was originally created by two IBM engineers working on querying data on an IBM relational database. SQL allows you to access and manipulate data stored in a database.
SQL is an ANSI standard language for accessing and manipulating data stored in a database. The American National Standards Institute (ANSI) is a private, non-profit organization that administers and coordinates the U.S. voluntary standards and conformity assessment system. Since SQL is an ANSI-standard language, it has been accepted as the standard language for accessing and manipulating data stored in a database.
To access data in the database, you run SQL commands that interact with the relational database. The table in our example is the Patients table. That table has rows and columns. The columns represent the properties of a Patient such as name, age, date of birth (dob), gender, address and so forth. The Patient table is thus a collection of data representing the same entity — patient.
Our focus in this SQL story will be working with data utilizing SQL in the databases.
The SQL query below returns the number (count) of all male patients from the Patients table.
SQL Fundamentals
In this section, we will look at the SQL query building blocks and SQL command categories. This section gives you a foundation for writing basic SQL queries to filter, analyze and summarize your data.
The Basic SQL Query Block
A SQL query is made up four basic clauses.
SQL command categories
There are FOUR common categories of SQL query statements.
DDL — Data Definition Language
DDL commands are used to :
Examples
These are a few examples. We will use the CREATE TABLE sql statement later in this story to build our sample table.
DML — Data Manipulation Language
DML (Data Manipulation Language) commands are used to modify the data in the database.
Examples
DQL — Data Query Language
DQL commands are used to access and retrieve data in a database. A main focus of this story will be on the SELECT statement
Examples
DCL — Data Control Language
These commands offer privileges or access rights for database user to perform certain actions in a database based on their roles.
Examples
GRANT — gives a user access privileges to the database
REVOKE — removes user access privileges to the database
Now that we’ve learn about SQL and performed a database overview, let’s dive right in and learn by practice.
Setting up your Environment — Getting Started — Install SQL Server Database
This is a practical story. To make it more interactive, we will need to learn by doing — practicing in an environment.
This isn’t death by PowerPoint.
Depending on your Operating System (OS), you will need to install SQL server in different ways. I have included links to download the required software to install SQL Server on :
For this story, I will use Microsoft SQL Server running in Docker on a Mac. All the material should work on a Windows and also Linux running SQL Server.
How To Install SQL Server on Docker
Refer to my medium article shown below on how to install SQL Server on Docker. It’s fairly straight forward and it should take you less than 10 minutes. https://faun.pub/how-to-install-sql-server-on-docker-cb6eb8e1c657
After a successful installation, you should see the following Azure Data Studio status page pointing to localhost.
Up to this point, we have successfully installed SQL Server on Docker, and installed Azure Data Studio (GUI) in order to interact with the SQL Server 2019 RDBMS. From this point onwards, you should be able to create databases, create tables, insert data into the tables, and query the data inside the tables.
Learn SQL Step by Step — Hands on Tutorial
Before you begin — Prerequisites
Download the attached SQL scripts Create the sample tables and insert data required to the database. Create a new database.
Create a new database on SQL Server Management Studio (SSMS)
Create a new database on Azure Data Studio
Azure Data Studio (formerly SQL Operations Studio) is a free GUI management tool that you can use to manage SQL Server on your Mac or PC. You can use it to create and manage databases, write queries, backup and restore databases, and more.
A SQL Server GUI for your Mac — Azure Data Studio
The Azure Data Studio dashboard.
Open Azure Data Studio and Click File → New Query
Set your connection — Connect to the localhost and Execute the SQL query shown below to create a HealthCareDB
After the new database has been created, Go to the file menu and open a new SQL script and execute the following SQL statements in the order provided. The SQL statement will do the following :
After the new database has been created, open a new SQL script and run (execute) the following SQL statements in the order provided. The SQL statement will do the following :
Create the Patient table
Click New Query and Run the SQL query shown below to insert data into the dummy Patients table.
Insert new patient rows into table ‘Patients’
You have now successfully created a sample database, created a sample table and inserted data into that table.
SQL COMMANDS
Here comes the fun part. We will now run various SQL queries for data analysis in our sample database.
SQL SELECT statement
The SQL SELECT statement retrieves data from the database. The basic SELECT statement syntax is as follows :
SQL SELECT TOP
The SQL SELECT TOP limits the number of rows returned from the sql statement. This is very important for large tables where you want to limit the output result in order to avoid impacting the database performance. In MySQL, there is a feature to limit by 10, 50 etc. built into the MySQL Workbench. In SQL Server, its important to set this limit in order to avoid impacting the application performance and disrupt services to the end user. This is because a SELECT * call retrieves all records from a database table and this is a very expensive operative for the database server.
SQL WHERE clause
The WHERE clause limit or restricts the SQL query to rows that meet a certain criteria or condition. The filter criteria can be column names, expressions or comparison operators (more on these later on).
SQL SELECT Distinct
The Distinct keyword returns unique results in a data set and it applies to ALL columns in the SELECT. You can specify columns needed after the DISTINCT qualifier and note that the result does not show duplicate rows. You can select or include multiple columns to return the desired result — include one or more columns.
SQL Order by
Sort rows with the ORDER BY clause in either ASC — ascending order, or DESC — descending order. The ORDER BY clause comes last in the SELECT statement. The ordering by position — You can use a column alias or a number for the column position in the ORDER BY clause
SQL GROUP BY
The GROUP BY statement returns a set of rows to give one result per group. The column(s) groups appear in the SELECT statement.
If you’ve reached this point, you are a rock star!
Enroll In Lifetime Access for FREE to this complete Udemy Course — SQL for Beginners 2022 — Learn SQL Step by Step — for the first 100 students 😃
https://www.udemy.com/course/sql-for-beginners-2022-learn-sql-step-by-step/?couponCode=MEDIUMDATA365
USE THE COUPON — MEDIUMDATA365
Thank you for reading this article. You should now be able to speak eloquently about SQL to your colleagues, and or during your next technical interview and perform SQL operations to retrieve, update and insert data into a database. 🙏
Join other developers and claim your FAUN account now!
Influence
Total Hits
Posts
Only registered users can post comments. Please, login or signup.