Using Group By to Rollup Data
This is a helper post to understand that denormalized data created with joins and in particular left joins or cross apply can be rolled up with the proper use of GROUP BY. In the following example I will use the example of redundant data created with LEFT JOIN of a table to itself to demonstrate.
Creating an example table with the following structure
CREATE TABLE DailyActivity
(
AccountNumber INT NOT NULL,
Name VARCHAR(100),
Activity VARCHAR(100),
DurationInMinutes INT
)
Here I am going to be some simple activity time tracking. Fill the table with the following data
INSERT INTO DailyActivity (AccountNumber, Name, Activity, DurationInMinutes)
VALUES
(1, 'John M', 'Sleep', 480),
(1, 'John M', 'Cocktails', 60),
(1, 'John M', 'Play Animal Crossing 2', 480),
(1, 'John M', 'Exercise', 30),
(1, 'John M', 'Meditate', 30),
(2, 'Bob B', 'Sleep', 480),
(2, 'Bob B', 'Lunch', 60),
(2, 'Bob B', 'Exercise', 30),
(2, 'Bob B', 'Play with Dog', 30),
(3, 'Kylo R', 'Sleep', 480),
(3, 'Kylo R', 'Lunch', 60),
(3, 'Kylo R', 'Crush Rebellion', 500),
(3, 'Kylo R', 'Exercise', 30),
(4, 'Astrid Z', 'Work', 800),
(4, 'Astrid Z', 'Lunch', 10),
(4, 'Astrid Z', 'Exercise', 120),
(5, 'Kat Y', 'Work', 800),
(5, 'Kat Y', 'Sleep', 480),
(5, 'Kat Y', 'Exercise', 90);
AccountNumber | Name | Activity | DurationInMinutes |
---|---|---|---|
1 | John M | Sleep | 480 |
1 | John M | Cocktails | 60 |
1 | John M | Play Animal Crossing 2 | 480 |
1 | John M | Exercise | 30 |
1 | John M | Meditate | 30 |
2 | Bob B | Sleep | 480 |
2 | Bob B | Lunch | 60 |
2 | Bob B | Exercise | 30 |
2 | Bob B | Play with Dog | 30 |
3 | Kylo R | Sleep | 480 |
3 | Kylo R | Lunch | 60 |
3 | Kylo R | Crush Rebellion | 500 |
3 | Kylo R | Exercise | 30 |
4 | Astrid Z | Work | 800 |
4 | Astrid Z | Lunch | 10 |
4 | Astrid Z | Exercise | 120 |
5 | Kat Y | Work | 800 |
5 | Kat Y | Sleep | 480 |
5 | Kat Y | Exercise | 90 |
Now to do a query where I determine who did the following Activities :
Work
Lunch
Both
Neither
SELECT DISTINCT Original.Name,
CASE WHEN Computed.Work > 0 THEN 'Yes' ELSE 'No' END as Work,
CASE WHEN Computed.Lunch > 0 THEN 'Yes' ELSE 'No' END as Lunch,
CASE WHEN Computed.Work > 0 AND Computed.Lunch > 0 THEN 'Yes' ELSE 'No' END as WorkAndLunch,
CASE WHEN Computed.Work = 0 AND Computed.Lunch = 0 THEN 'Yes' ELSE 'No' END as Neither
FROM DailyActivity as Original
INNER JOIN
(
SELECT AccountNumber, MAX(DidWork) as Work, MAX(AteLunch) as Lunch
FROM
(
SELECT DailyActivity.AccountNumber,
CASE WHEN WorkerTime.Activity IS NOT NULL THEN 1 ELSE 0 END as DidWork,
CASE WHEN LunchTime.Activity IS NOT NULL THEN 1 ELSE 0 END as AteLunch
FROM dbo.DailyActivity
LEFT JOIN (
SELECT AccountNumber, name, Activity, DurationInMinutes
FROM dbo.DailyActivity
WHERE Activity = 'Work'
) AS WorkerTime
ON DailyActivity.AccountNumber = WorkerTime.AccountNumber
LEFT JOIN (
SELECT AccountNumber, name, Activity, DurationInMinutes
FROM dbo.DailyActivity
WHERE Activity = 'Lunch'
) AS LunchTime
ON DailyActivity.AccountNumber = LunchTime.AccountNumber
) as DayActivity
GROUP BY DayActivity.AccountNumber, DayActivity.DidWork, DayActivity.AteLunch
) as Computed
ON Original.AccountNumber = Computed.AccountNumber
Gives the this where I have used the GROUP BY to rollup the denormalized data and then by setting the match to be a 1 or 0 I can use the MAX function to get the distinct values from the set to indicate if the activity occurred for that person.
As you can see Astrid
is the only person to take both a work and lunch (she worked over 13.3 hours in the day so definitely deserved) and John M
did neither in the activities tracked
Name | Work | Lunch | WorkAndLunch | Neither |
---|---|---|---|---|
Astrid Z | Yes | Yes | Yes | No |
Bob B | No | Yes | No | No |
John M | No | No | No | Yes |
Kat Y | Yes | No | No | No |
Kylo R | No | Yes | No | No |
What's happening
Breakdown it from inside out like so take the inner query that manipulates the denormalized data
SELECT DailyActivity.AccountNumber,
CASE WHEN WorkerTime.Activity IS NOT NULL THEN 1 ELSE 0 END as DidWork,
CASE WHEN LunchTime.Activity IS NOT NULL THEN 1 ELSE 0 END as AteLunch
FROM dbo.DailyActivity
LEFT JOIN (
SELECT AccountNumber, name, Activity, DurationInMinutes
FROM dbo.DailyActivity
WHERE Activity = 'Work'
) AS WorkerTime
ON DailyActivity.AccountNumber = WorkerTime.AccountNumber
LEFT JOIN (
SELECT AccountNumber, name, Activity, DurationInMinutes
FROM dbo.DailyActivity
WHERE Activity = 'Lunch'
) AS LunchTime
ON DailyActivity.AccountNumber = LunchTime.AccountNumber
AccountNumber | DidWork | AteLunch |
---|---|---|
1 | 0 | 0 |
1 | 0 | 0 |
1 | 0 | 0 |
1 | 0 | 0 |
1 | 0 | 0 |
2 | 0 | 1 |
2 | 0 | 1 |
2 | 0 | 1 |
2 | 0 | 1 |
3 | 0 | 1 |
3 | 0 | 1 |
3 | 0 | 1 |
3 | 0 | 1 |
4 | 1 | 1 |
4 | 1 | 1 |
4 | 1 | 1 |
5 | 1 | 0 |
5 | 1 | 0 |
5 | 1 | 0 |
Now group the results by the account number and get if the action occurred or not with the MAX function in the group on the activity
SELECT AccountNumber, MAX(DidWork) as Work, MAX(AteLunch) as Lunch
FROM
(
SELECT DailyActivity.AccountNumber,
CASE WHEN WorkerTime.Activity IS NOT NULL THEN 1 ELSE 0 END as DidWork,
CASE WHEN LunchTime.Activity IS NOT NULL THEN 1 ELSE 0 END as AteLunch
FROM dbo.DailyActivity
LEFT JOIN (
SELECT AccountNumber, name, Activity, DurationInMinutes
FROM dbo.DailyActivity
WHERE Activity = 'Work'
) AS WorkerTime
ON DailyActivity.AccountNumber = WorkerTime.AccountNumber
LEFT JOIN (
SELECT AccountNumber, name, Activity, DurationInMinutes
FROM dbo.DailyActivity
WHERE Activity = 'Lunch'
) AS LunchTime
ON DailyActivity.AccountNumber = LunchTime.AccountNumber
) as DayActivity
GROUP BY DayActivity.AccountNumber, DayActivity.DidWork, DayActivity.AteLunch
Now the the data is matched up to each account and if the Work, Lunch
occurred during the day
AccountNumber | Work | Lunch |
---|---|---|
1 | 0 | 0 |
2 | 0 | 1 |
3 | 0 | 1 |
4 | 1 | 1 |
5 | 1 | 0 |
Now do some wrapping and join back to the original denormalized data table us a DISTINCT and some CASE WHENs to make the data pretty and human readable or at least directly bindable to an output view
SELECT DISTINCT Original.Name,
CASE WHEN Computed.Work > 0 THEN 'Yes' ELSE 'No' END as Work,
CASE WHEN Computed.Lunch > 0 THEN 'Yes' ELSE 'No' END as Lunch,
CASE WHEN Computed.Work > 0 AND Computed.Lunch > 0 THEN 'Yes' ELSE 'No' END as WorkAndLunch,
CASE WHEN Computed.Work = 0 AND Computed.Lunch = 0 THEN 'Yes' ELSE 'No' END as Neither
FROM DailyActivity as Original
INNER JOIN
(
SELECT AccountNumber, MAX(DidWork) as Work, MAX(AteLunch) as Lunch
FROM
(
SELECT DailyActivity.AccountNumber,
CASE WHEN WorkerTime.Activity IS NOT NULL THEN 1 ELSE 0 END as DidWork,
CASE WHEN LunchTime.Activity IS NOT NULL THEN 1 ELSE 0 END as AteLunch
FROM dbo.DailyActivity
LEFT JOIN (
SELECT AccountNumber, name, Activity, DurationInMinutes
FROM dbo.DailyActivity
WHERE Activity = 'Work'
) AS WorkerTime
ON DailyActivity.AccountNumber = WorkerTime.AccountNumber
LEFT JOIN (
SELECT AccountNumber, name, Activity, DurationInMinutes
FROM dbo.DailyActivity
WHERE Activity = 'Lunch'
) AS LunchTime
ON DailyActivity.AccountNumber = LunchTime.AccountNumber
) as DayActivity
GROUP BY DayActivity.AccountNumber, DayActivity.DidWork, DayActivity.AteLunch
) as Computed
ON Original.AccountNumber = Computed.AccountNumber
Results are looking good
Name | Work | Lunch | WorkAndLunch | Neither |
---|---|---|---|---|
Astrid Z | Yes | Yes | Yes | No |
Bob B | No | Yes | No | No |
John M | No | No | No | Yes |
Kat Y | Yes | No | No | No |
Kylo R | No | Yes | No | No |
The execution plan though shows the complexity of the steps
We can fix the table scans by adding a clustered index on AccountNumber
we are allowed to do this because there is no primary key on this table otherwise the Primary Key would take that place.
CREATE CLUSTERED INDEX [IX_AccountNumber] ON [dbo].[DailyActivity]( [AccountNumber] ASC )
And also add a non clustered index on the Activity
column as it needs it for this part of the WHERE clause WHERE Activity = 'Work'
and WHERE Activity = 'Lunch'
CREATE NONCLUSTERED INDEX [IX_Activity] ON [dbo].[DailyActivity]( [Activity] ASC )
Now the query plan looks like this
Awesome much improved and performant on the denormalized original table.