Hi, in this blog we will get an idea on how to mange your tabular data in 2 ways:
- Using a Relational Database and since I’m a fan of Microsoft Tech we’ll go with Microsoft SQL Server
- Using the new thing from the Power Platform, Common Data Service (CDS)
If you have some knowledge about those 2 you’re probably wondering why I picked them to make a parallel tutorial on how to manage your data, so here are my reasons:
- One is the “old champ” and the other is the “rising star” in the Microsoft Ecosystem for managing Tabular data (and if we look at their latest capabilities, not only structured)
- For one we can go with the old ways of managing relational data (star, snowflake, etc.) and the other will probably hold way more data in one place, so different approaches
- Sometimes you have to use one, sometimes the other and sometimes you can combine both in order to obtain optimal performance for your solution
- One can be accessed with kind of anything (console app, function, native app, you name it) or has very very limited access (it’s new, so this post will get a follow up when the time comes)
What we’ll cover
For Both
- Determinate the purpose of your DB
- How to access your data from an app
For SQL Server
- How data needs to be structured in a blueprint
- Find and organize the information required
- Divide the information into tables
- Specify primary/secondary keys + Table relationships
- Refine design + Normalization rules
- Make it scalable
For CDS
- Create entities and fields
- Create relationships/lookup fields
- Views and Business rules
- When and Why to use option sets
- Make it scalable
Setup
Our Dataset
In order to have a Real-World Scenario we will be using a decent dateset that can be found here, thank you Daniel Lemire. We will go with Census1881 which provides us with:
- 4.277.807 Rows
- 7 Columns (I did my own naming for them, maybe we will use them for some ML in the future also)
- First Column has a Cardinality of 152878 (Related to the name of the person)
- Second Column has a Cardinality of 152363 (Related to the name of the person)
- The third Column has a Cardinality of 137 (Related to the name of the person, weird, right?)
- The fourth Column has a Cardinality of 8798 (Related to Religion)
- The fifth Column has a Cardinality of 2722 (Related to Location)
- The sixth Column has a Cardinality of 24223 (Related to Occupation)
- The seventh Column has a Cardinality of 2127 (A label column)
SQL Server
For SQL Server we will go with a local setup as it the same as working with Azure SQL but requires no extra cost. In order to have the optimal setup for this task I recommend:
- SQL Server Developer Edition (Latest Version – As I am writing this blog I am using 2019)
- SQL Server Management Studio (Latest Version)
In order to save some time I have created a database version of the dataset that can be found here. You can just download it and restore it. After the process is done you can check the success of the process by running the following script and checking the numbers with the ones I have listed above.
SELECT (
SELECT COUNT(DISTINCT [FirstName])
FROM [Census1881DB].[dbo].[RawData]
) [FirstNameCardinaly]
,(
SELECT COUNT(DISTINCT [LastName])
FROM [Census1881DB].[dbo].[RawData]
) [LastNameCardinaly]
,(
SELECT COUNT(DISTINCT [Age])
FROM [Census1881DB].[dbo].[RawData]
) [AgeCardinaly]
,(
SELECT COUNT(DISTINCT [Religion])
FROM [Census1881DB].[dbo].[RawData]
) [ReligionCardinaly]
,(
SELECT COUNT(DISTINCT [Location])
FROM [Census1881DB].[dbo].[RawData]
) [LocationCardinaly]
,(
SELECT COUNT(DISTINCT [Ocupation])
FROM [Census1881DB].[dbo].[RawData]
) [OcupationCardinaly]
,(
SELECT COUNT(DISTINCT [Class])
FROM [Census1881DB].[dbo].[RawData]
) [ClassCardinaly]
,(
SELECT COUNT(*)
FROM [Census1881DB].[dbo].[RawData]
) [NumberOfRows]
CDS Setup
As there is a data import in SQL Server, there is one in the Power Platform. You have to Entities and click on Get data . Then we will go to our CSV, I am storing it in OneDrive.
Then we will get the Transform Data screen and we will do nothing, we will proceed. (In case you are saying the header as a row go to Transform -> Use first row as headers)
When reaching the Map entities screen we should see something like this:
As for the the refresh will will go with Manual, as our file is not changing and after we are done in a few seconds a new entity will appear.
SQL Server
In order not to add some Python to this post (although it would be fun) we will do some data Visualization in the SQL Database to understand our data more.
We have seen in the “Setup\Our Dataset” part how our data looks, numerically speaking, but now we have to also understand it and how we can leverage its structure for each chosen DB.
One small detail for when to pic NVARCHAR(255) vs NVARCHAR(MAX):
- With a fixed size you data structure will be optimized, but you need to know the actual limits and then setting it to something really big will lose all the effects of the optimizations
- Anything on NVARCHAR(MAX) is generally pain, so keep it only for optional fields
Understanding the Data
Let’s do a quick SELECT and see some of our data:
SELECT TOP (10) [FirstName]
,[LastName]
,[Age]
,[Religion]
,[Location]
,[Ocupation]
,[Class]
FROM [Census1881DB].[dbo].[RawData]
WHERE [Ocupation] IS NOT NULL -- We filtered because it is not mandatory to have this field
We already knew that some are categorical data from the name/numbers, but let’s take each column and see how we can make more sense out of it in the idea or relational DBs and see how we can use each column/ group of columns:
- FirstName and LastName need to go together in an Entity called Person because they only make sense together. For this entity, we will add the info required to add value when identifying people.
- Age is a categorical number but it makes no sense without being attached to the Person. However, we can create a new Entity called AgeGroup where we can store the inferior limit, the superior limit, and some details in order to add more power to our data.
- Religion, like Age, is categorical data, but in this case, we don’t necessarily need the explicit value attached to the Person entity. For this, we should create a different entity and link those 2.
- Location and Occupation (mind the typo in the column name) need the same treatment as the Religion column. They need separate entities linked to the Person entity and adding extra info to them for future use.
- The Class column is just a label and, depending on your needs, you can even discard it.
In data science adding extra information based on the current information is called “feature engineering” and this is what we will also aim for.
Building Relationships for your Data
In general, you will start from the child tables (nodes) to their parents and so on the roots, meaning you will first build the tables that have no Foreign Key finishing with those that have it. When it comes to independent nodes we will pick a random order. Those being said we will do:
AgeGroup
CREATE TABLE [AgeGroups] (
[ID] INT PRIMARY KEY
,[MinAge] INT NOT NULL
,[MaxAge] INT NOT NULL
,[GroupDescription] NVARCHAR(255)
,[Notes] NVARCHAR(MAX)
);
And now to populate the data, I picked some general age groups, we will not use our actual data:
INSERT INTO [Census1881DB].[dbo].[AgeGroups]
SELECT 1 [ID]
,0 [MinAge]
,3 [MaxAge]
,'Really Small Children' [GroupDescription]
,'' [ExtraNotes]
UNION
SELECT 2 [ID]
,4 [MinAge]
,7 [MaxAge]
,'Small Children' [GroupDescription]
,'' [ExtraNotes]
UNION
SELECT 3 [ID]
,8 [MinAge]
,15 [MaxAge]
,'Children' [GroupDescription]
,'' [ExtraNotes]
UNION
SELECT 4 [ID]
,16 [MinAge]
,21 [MaxAge]
,'Teenagers' [GroupDescription]
,'' [ExtraNotes]
UNION
SELECT 5 [ID]
,22 [MinAge]
,30 [MaxAge]
,'Young Adults' [GroupDescription]
,'' [ExtraNotes]
UNION
SELECT 6 [ID]
,31 [MinAge]
,60 [MaxAge]
,'Adults' [GroupDescription]
,'' [ExtraNotes]
UNION
SELECT 7 [ID]
,61 [MinAge]
,123 [MaxAge]
,'Old People' [GroupDescription]
,'' [ExtraNotes]
UNION
SELECT 8 [ID]
,124 [MinAge]
,999 [MaxAge]
,'OutLayers' [GroupDescription]
,'' [ExtraNotes]
Now let’s see how our Data Looks by looking at the Age Groups:
SELECT AG.[GroupDescription]
,COUNT(AG.[GroupDescription]) [NumberOfIndividuals]
,AG.[MinAge]
,MIN(RD.[Age]) [ActualMinAge]
,AG.[MaxAge]
,MAX(RD.[Age]) [ActualMaxAge]
,(AG.[MaxAge] - AG.[MinAge]) [IntervalSize]
,(COUNT(AG.[GroupDescription]) / (AG.[MaxAge] - AG.[MinAge])) [AverageDistribution]
FROM [Census1881DB].[dbo].[RawData] RD
INNER JOIN [Census1881DB].[dbo].[AgeGroups] AG ON COALESCE(RD.[Age], 999) BETWEEN AG.[MinAge]
AND AG.[MaxAge]
GROUP BY AG.[GroupDescription]
,AG.[MinAge]
,AG.[MaxAge]
ORDER BY AG.[MinAge]
As we can see, in general, we don’t have the data evenly distributed for those age groups but that is not what we are looking for. Depending on your task at hand you will probably go for a different split but as long as you keep it structured you can get powerful insights out of your data.
Religion
CREATE TABLE [Religion] (
[ID] INT IDENTITY(1, 1) PRIMARY KEY
,[Name] NVARCHAR(255)
,[Notes] NVARCHAR(MAX)
);
Since this will feed off our data, we need that identity set and we will insert everything that we have:
INSERT INTO [Census1881DB].[dbo].[Religion] ([Name])
SELECT DISTINCT (upper(substring([Religion], 1, 1)) + lower(substring([Religion], 2, len([Religion]) - 1)))
FROM [Census1881DB].[dbo].[RawData]
And to have some insights:
SELECT R.[Name] [NameOfReligion]
,COUNT(R.[Name]) [NumberOfFollowers]
FROM [Census1881DB].[dbo].[RawData] RD
INNER JOIN [Census1881DB].[dbo].[Religion] R ON R.[Name] = (upper(substring(RD.[Religion], 1, 1)) + lower(substring(RD.[Religion], 2, len(RD.[Religion]) - 1)))
GROUP BY R.[Name]
ORDER BY [NumberOfFollowers] DESC
We can see that our data is not cleaned, we could do some cleaning (probably in a future post), but this is also the case of real data, not everyone inputs data it in the same manner.
Location
Same treatment as Religion entity and we will see once again that the data needs pre-processing.:
CREATE TABLE [Location] (
[ID] INT IDENTITY(1, 1) PRIMARY KEY
,[Name] NVARCHAR(255)
,[Notes] NVARCHAR(MAX)
);
INSERT INTO [Census1881DB].[dbo].[Location] ([Name])
SELECT DISTINCT (upper(substring([Location], 1, 1)) + lower(substring([Location], 2, len([Location]) - 1)))
FROM [Census1881DB].[dbo].[RawData]
SELECT L.[Name] [NameOfLocation]
,COUNT(L.[Name]) [NumberOfResidents]
FROM [Census1881DB].[dbo].[RawData] RD
INNER JOIN [Census1881DB].[dbo].[Location] L ON L.[Name] = (upper(substring(RD.[Location], 1, 1)) + lower(substring(RD.[Location], 2, len(RD.[Location]) - 1)))
GROUP BY L.[Name]
ORDER BY [NumberOfResidents] DESC
Occupation
And one last time:
CREATE TABLE [Occupation] (
[ID] INT IDENTITY(1, 1) PRIMARY KEY
,[Name] NVARCHAR(255)
,[Notes] NVARCHAR(MAX)
);
INSERT INTO [Census1881DB].[dbo].[Occupation] ([Name])
SELECT DISTINCT (upper(substring([Ocupation], 1, 1)) + lower(substring([Ocupation], 2, len([Ocupation]) - 1)))
FROM [Census1881DB].[dbo].[RawData]
SELECT O.[Name] [NameOfOccupation]
,COUNT(O.[Name]) [NumberOfPeopleWithThatOccupation]
FROM [Census1881DB].[dbo].[RawData] RD
INNER JOIN [Census1881DB].[dbo].[Occupation] O ON O.[Name] = (upper(substring(RD.[Ocupation], 1, 1)) + lower(substring(RD.[Ocupation], 2, len(RD.[Ocupation]) - 1)))
GROUP BY O.[Name]
ORDER BY [NumberOfPeopleWithThatOccupation] DESC
Now the data is close but somehow localized, again some cleaning would be nice.
Putting it all together
The idea is to write the query that creates the table in the same way you will later use it in order to build the query or at least close:
SELECT (upper(substring([FirstName], 1, 1)) + lower(substring([FirstName], 2, len([FirstName]) - 1))) + ' ' + (upper(substring([LastName], 1, 1)) + lower(substring([LastName], 2, len([LastName]) - 1))) [FullName]
,[Age]
,R.[ID] [ReligionID]
,L.[ID] [LocationID]
,O.[ID] [OccupationID]
INTO [Census1881DB].[dbo].[CleanData]
FROM [Census1881DB].[dbo].[RawData] RD
INNER JOIN [Census1881DB].[dbo].[Religion] R ON R.[Name] = (upper(substring(RD.[Religion], 1, 1)) + lower(substring(RD.[Religion], 2, len(RD.[Religion]) - 1)))
INNER JOIN [Census1881DB].[dbo].[Location] L ON L.[Name] = (upper(substring(RD.[Location], 1, 1)) + lower(substring(RD.[Location], 2, len(RD.[Location]) - 1)))
INNER JOIN [Census1881DB].[dbo].[Occupation] O ON O.[Name] = (upper(substring(RD.[Ocupation], 1, 1)) + lower(substring(RD.[Ocupation], 2, len(RD.[Ocupation]) - 1)))
WHERE (upper(substring([FirstName], 1, 1)) + lower(substring([FirstName], 2, len([FirstName]) - 1))) + ' ' + (upper(substring([LastName], 1, 1)) + lower(substring([LastName], 2, len([LastName]) - 1))) IS NOT NULL
And add some Foreign Keys:
ALTER TABLE [dbo].[CleanData] ADD CONSTRAINT FK_Location_ID FOREIGN KEY ([LocationID]) REFERENCES [dbo].[Location] (ID);
ALTER TABLE [dbo].[CleanData] ADD CONSTRAINT FK_Religion_ID FOREIGN KEY ([ReligionID]) REFERENCES [dbo].[Religion] (ID);
ALTER TABLE [dbo].[CleanData] ADD CONSTRAINT FK_Occupation_ID FOREIGN KEY ([OccupationID]) REFERENCES [dbo].[Occupation] (ID);
And now to get the entire data from the first table, a little bit cleansed:
SELECT CD.[FullName]
,COALESCE(CD.[Age], 999) [Age]
,AG.[GroupDescription]
,R.[Name] [ReligionName]
,L.[Name] [LocationName]
,O.[Name] [OccupationName]
FROM [Census1881DB].[dbo].[CleanData] CD
INNER JOIN [Census1881DB].[dbo].[Location] L ON L.[ID] = CD.[LocationID]
INNER JOIN [Census1881DB].[dbo].[Religion] R ON R.[ID] = CD.[ReligionID]
INNER JOIN [Census1881DB].[dbo].[Occupation] O ON O.[ID] = CD.[OccupationID]
INNER JOIN [dbo].[AgeGroups] AG ON COALESCE(CD.[Age], 999) BETWEEN AG.[MinAge]
AND AG.[MaxAge]
Some random data. After this step, we can enchant our DB with more columns, more data, in order to meet the goal.
Make it scalable
Our luck is that we are working with an Azure-ready structure so it is really easy to configure it and up or down the performance of our DB, just by doing a few clicks in the portal.
We can also help the process by doing either of the following:
- Adding indexes on most used columns and rebuilding them regularly
- Cleaning the data periodically
- Preprocessing data so we do not have useless values
- Optimizing the allocation size for different columns
- Writing queries as specific as they can be
- When logging is a frequent thing, pull the logging to a different DB
CDS
I will start with some content copied from the docs and them move on with the standard structure.
What is CDS? Why CDS?
Common Data Service lets you securely store and manage data that’s used by business applications. Data within Common Data Service is stored within a set of entities. An entity is a set of records used to store data, similar to how a table stores data within a database. Common Data Service includes a base set of standard entities that cover typical scenarios, but you can also create custom entities specific to your organization and populate them with data using Power Query. App makers can then use Power Apps to build rich applications using this data.
Standard and custom entities within Common Data Service provide a secure and cloud-based storage option for your data. Entities let you create a business-focused definition of your organization’s data for use within apps. If you’re not sure if entities are your best option, consider these benefits:
- Easy to manage – Both the metadata and data are stored in the cloud. You don’t need to worry about the details of how they’re stored.
- Easy to secure – Data is securely stored so that users can see it only if you grant them access. Role-based security allows you to control access to entities for different users within your organization.
- Access your Dynamics 365 Data – Data from your Dynamics 365 applications is also stored within the Common Data Service allowing you to quickly build apps that leverage your Dynamics 365 data and extend your apps using Power Apps.
- Rich metadata – Data types and relationships are leveraged directly within Power Apps.
- Logic and validation – Define calculated fields, business rules, workflows, and business process flows to ensure data quality and drive business processes.
- Productivity tools – Entities are available within the add-ins for Microsoft Excel to increase productivity and ensure data accessibility.
Create entities and fields
Comparing CDS to the RDMS we will get:
- Tables are Entities without normalization Rules
- Fields are Columns
Creating those is a straight portal and can be done from the portal. When it comes to “good practices” there are currently none as the CDS is evolving and we can not relate its performance to a classical schema.
Let us check the structure of our newly created entity, the census:
As for the meta-data of the fields we can see:
- Their Display name, what you can see in a friendly way about the meaning of the field
- Their Name, what the systems/APIs/non-humans (meaning devs) see when they try to do some manipulation over this data
- Their Data type and we have:
- Their Type, meaning either Custom (Created by user) or Standard (Created by the system)
- If they are Customizable
- If they are Required
- If they are Searchable, meaning they will get something close to an index
Now moving to the actual fields we can see that aside from our imported fields we also have some extra fields which are standard for the CDS and help us throughout the entire development and reporting process, for example:
- Created by and Created on are aiding us to see the initial creator of that entity and the date(row)
- Name is our classic Primary Key
- Modified by and Modified on are helping us see the last edit
Create relationships/lookup fields
If you’re thinking about the classical relationship model from the RDBMS you’re in luck because here it is something similar, but, luckily, there is no need for a primary Key, you can directly add a new field that is a reference to a field in some other entity.
In order to add a lookup field you just need to add a new field, set its type to Lookup, and then set the related entity and you’re all set.
Views and Business rules
Views are very similar to a traditional DB notion meaning that you create another vision of your data based on some rules, permission and so on. Since this data model is meant for business people more than devs it means that it is easier to do what you want and it is already connected to multiple data points such as Azure AD so you can inherit a lot of meta-data.
Business rules are something cool that you don’t have in a RDMS but you can make an equivalent with a trigger on some tables. The idea is simple:
- When SOMETHING => DO
And for this notion there are some standard scenarios: When a user joins, limit his access and make him add some data or edit some data.
When and Why to use option sets
Remember the enums from the programming classes, those standardized values that make some sense when you see the name but behind the scene is just a number? Those are called option sets here.
When and why to use them:
- You need a standardized way of answering a question, for example: What type of phone number is this? Options = Mobile / Work / Other
- You need to limit the freedom of others, for example: Pick one of the following toppings for your pizza…
- You need the data in a clean (categorical) form: Please input your country…
Make it scalable
We have talked about CDS, how easy it is to do things, and how much you can do with just a few clicks, which means that we have reached the section where we get some downsides.
As of now, you can’t scale the CDS (as it is linked to Dynamics 365 and the Power Platform) yourself. But Microsoft does a good job keeping everything close to optimal parameters.
My opinion
As you have seen both tools are really powerful and optimized for one type of use, so how would I pick something and why:
- If I need some data that is already linked with something in the CDS or will be because of a Power Platform related process I would go for the CDS approach.
- If I need a fresh approach with low latency, perks of optimizations, and the need for scaling things I would go for SQL Server.
SQL Server is powerful but requires some work, it is not directly connected to some data sources and you need to do some configuration on your own. It would be the best choice for a developer.
CDS is like a black-box, you have some control of the input and you’ll have some control of the resulted output, but if you want to fix something inside… good luck! This is meant for non-developers.
You know the saying: “Use the right tool for the right job”. There is no way of arguing which one is better since they are not meant to do exactly the same thing. If you just want to store data in a whatever way, yes, you can argue. But if data storing is the purpose of the app, use blobs, they are cool.
Thank you for reading!
If you have any comments, suggestions, or requests for the future please leave a comment and help me grow, thanks!