Using Group By to Rollup Data

This is a helper post to understand that denormalized data created with joins and in particular left joins or cross apply can be rolled up with the proper use of GROUP BY.  In the following example I will use the example of redundant data created with LEFT JOIN of a table to itself to demonstrate.

Creating an example table with the following structure

CREATE TABLE DailyActivity
( 
	AccountNumber INT NOT NULL,
	Name VARCHAR(100),
	Activity VARCHAR(100),
	DurationInMinutes INT
)

Here I am going to be some simple activity time tracking.  Fill the table with the following data

INSERT INTO DailyActivity (AccountNumber, Name, Activity, DurationInMinutes)
VALUES
	(1, 'John M', 'Sleep', 480),
	(1, 'John M', 'Cocktails', 60),
	(1, 'John M', 'Play Animal Crossing 2', 480),
	(1, 'John M', 'Exercise', 30),
	(1, 'John M', 'Meditate', 30),
	(2, 'Bob B', 'Sleep', 480),
	(2, 'Bob B', 'Lunch', 60),
	(2, 'Bob B', 'Exercise', 30),
	(2, 'Bob B', 'Play with Dog', 30),
	(3, 'Kylo R', 'Sleep', 480),
	(3, 'Kylo R', 'Lunch', 60),
	(3, 'Kylo R', 'Crush Rebellion', 500),
	(3, 'Kylo R', 'Exercise', 30),
	(4, 'Astrid Z', 'Work', 800),
	(4, 'Astrid Z', 'Lunch', 10),
	(4, 'Astrid Z', 'Exercise', 120),
	(5, 'Kat Y', 'Work', 800),
	(5, 'Kat Y', 'Sleep', 480),
	(5, 'Kat Y', 'Exercise', 90);
AccountNumber Name Activity DurationInMinutes
1 John M Sleep 480
1 John M Cocktails 60
1 John M Play Animal Crossing 2 480
1 John M Exercise 30
1 John M Meditate 30
2 Bob B Sleep 480
2 Bob B Lunch 60
2 Bob B Exercise 30
2 Bob B Play with Dog 30
3 Kylo R Sleep 480
3 Kylo R Lunch 60
3 Kylo R Crush Rebellion 500
3 Kylo R Exercise 30
4 Astrid Z Work 800
4 Astrid Z Lunch 10
4 Astrid Z Exercise 120
5 Kat Y Work 800
5 Kat Y Sleep 480
5 Kat Y Exercise 90

Now to do a query where I determine who did the following Activities :

  • Work
  • Lunch
  • Both
  • Neither
SELECT DISTINCT Original.Name,
	CASE WHEN Computed.Work > 0 THEN 'Yes' ELSE 'No' END as Work,
	CASE WHEN Computed.Lunch > 0 THEN 'Yes' ELSE 'No' END as Lunch,
	CASE WHEN Computed.Work > 0 AND Computed.Lunch > 0 THEN 'Yes' ELSE 'No' END as WorkAndLunch,
	CASE WHEN Computed.Work = 0 AND Computed.Lunch = 0 THEN 'Yes' ELSE 'No' END as Neither
FROM DailyActivity as Original
INNER JOIN
(
	SELECT AccountNumber, MAX(DidWork) as Work, MAX(AteLunch) as Lunch
	FROM 
	(
		SELECT DailyActivity.AccountNumber,
			CASE WHEN WorkerTime.Activity IS NOT NULL THEN 1 ELSE 0 END as DidWork,
			CASE WHEN LunchTime.Activity IS NOT NULL THEN 1 ELSE 0 END as AteLunch
		FROM dbo.DailyActivity
		LEFT JOIN (
			SELECT AccountNumber, name, Activity, DurationInMinutes
			FROM dbo.DailyActivity
			WHERE Activity = 'Work'
		) AS WorkerTime
		ON DailyActivity.AccountNumber = WorkerTime.AccountNumber
		LEFT JOIN (
			SELECT AccountNumber, name, Activity, DurationInMinutes
			FROM dbo.DailyActivity
			WHERE Activity = 'Lunch'
		) AS LunchTime
		ON DailyActivity.AccountNumber = LunchTime.AccountNumber	
	) as DayActivity
	GROUP BY DayActivity.AccountNumber, DayActivity.DidWork, DayActivity.AteLunch
) as Computed
ON Original.AccountNumber = Computed.AccountNumber

Gives the this where I have used the GROUP BY to rollup the denormalized data and then by setting the match to be a 1 or 0 I can use the MAX function to get the distinct values from the set to indicate if the activity occurred for that person.

As you can see Astrid is the only person to take both a work and lunch (she worked over 13.3 hours in the day so definitely deserved) and John M did neither in the activities tracked

Name Work Lunch WorkAndLunch Neither
Astrid Z Yes Yes Yes No
Bob B No Yes No No
John M No No No Yes
Kat Y Yes No No No
Kylo R No Yes No No

What's happening

Breakdown it from inside out like so take the inner query that manipulates the denormalized data

SELECT DailyActivity.AccountNumber,
	CASE WHEN WorkerTime.Activity IS NOT NULL THEN 1 ELSE 0 END as DidWork,
	CASE WHEN LunchTime.Activity IS NOT NULL THEN 1 ELSE 0 END as AteLunch
FROM dbo.DailyActivity
LEFT JOIN (
	SELECT AccountNumber, name, Activity, DurationInMinutes
	FROM dbo.DailyActivity
	WHERE Activity = 'Work'
) AS WorkerTime
ON DailyActivity.AccountNumber = WorkerTime.AccountNumber
LEFT JOIN (
	SELECT AccountNumber, name, Activity, DurationInMinutes
	FROM dbo.DailyActivity
	WHERE Activity = 'Lunch'
) AS LunchTime
ON DailyActivity.AccountNumber = LunchTime.AccountNumber	
AccountNumber DidWork AteLunch
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
2 0 1
2 0 1
2 0 1
2 0 1
3 0 1
3 0 1
3 0 1
3 0 1
4 1 1
4 1 1
4 1 1
5 1 0
5 1 0
5 1 0

Now group the results by the account number and get if the action occurred or not with the MAX function in the group on the activity

SELECT AccountNumber, MAX(DidWork) as Work, MAX(AteLunch) as Lunch
FROM 
(
	SELECT DailyActivity.AccountNumber,
		CASE WHEN WorkerTime.Activity IS NOT NULL THEN 1 ELSE 0 END as DidWork,
		CASE WHEN LunchTime.Activity IS NOT NULL THEN 1 ELSE 0 END as AteLunch
	FROM dbo.DailyActivity
	LEFT JOIN (
		SELECT AccountNumber, name, Activity, DurationInMinutes
		FROM dbo.DailyActivity
		WHERE Activity = 'Work'
	) AS WorkerTime
	ON DailyActivity.AccountNumber = WorkerTime.AccountNumber
	LEFT JOIN (
		SELECT AccountNumber, name, Activity, DurationInMinutes
		FROM dbo.DailyActivity
		WHERE Activity = 'Lunch'
	) AS LunchTime
	ON DailyActivity.AccountNumber = LunchTime.AccountNumber	
) as DayActivity
GROUP BY DayActivity.AccountNumber, DayActivity.DidWork, DayActivity.AteLunch

Now the the data is matched up to each account and if the Work, Lunch occurred during the day

AccountNumber Work Lunch
1 0 0
2 0 1
3 0 1
4 1 1
5 1 0

Now do some wrapping and join back to the original denormalized data table us a DISTINCT and some CASE WHENs to make the data pretty and human readable or at least directly bindable to an output view

SELECT DISTINCT Original.Name,
	CASE WHEN Computed.Work > 0 THEN 'Yes' ELSE 'No' END as Work,
	CASE WHEN Computed.Lunch > 0 THEN 'Yes' ELSE 'No' END as Lunch,
	CASE WHEN Computed.Work > 0 AND Computed.Lunch > 0 THEN 'Yes' ELSE 'No' END as WorkAndLunch,
	CASE WHEN Computed.Work = 0 AND Computed.Lunch = 0 THEN 'Yes' ELSE 'No' END as Neither
FROM DailyActivity as Original
INNER JOIN
(
	SELECT AccountNumber, MAX(DidWork) as Work, MAX(AteLunch) as Lunch
	FROM 
	(
		SELECT DailyActivity.AccountNumber,
			CASE WHEN WorkerTime.Activity IS NOT NULL THEN 1 ELSE 0 END as DidWork,
			CASE WHEN LunchTime.Activity IS NOT NULL THEN 1 ELSE 0 END as AteLunch
		FROM dbo.DailyActivity
		LEFT JOIN (
			SELECT AccountNumber, name, Activity, DurationInMinutes
			FROM dbo.DailyActivity
			WHERE Activity = 'Work'
		) AS WorkerTime
		ON DailyActivity.AccountNumber = WorkerTime.AccountNumber
		LEFT JOIN (
			SELECT AccountNumber, name, Activity, DurationInMinutes
			FROM dbo.DailyActivity
			WHERE Activity = 'Lunch'
		) AS LunchTime
		ON DailyActivity.AccountNumber = LunchTime.AccountNumber	
	) as DayActivity
	GROUP BY DayActivity.AccountNumber, DayActivity.DidWork, DayActivity.AteLunch
) as Computed
ON Original.AccountNumber = Computed.AccountNumber

Results are looking good

Name Work Lunch WorkAndLunch Neither
Astrid Z Yes Yes Yes No
Bob B No Yes No No
John M No No No Yes
Kat Y Yes No No No
Kylo R No Yes No No

The execution plan though shows the complexity of the steps

We can fix the table scans by adding a clustered index on AccountNumber we are allowed to do this because there is no primary key on this table otherwise the Primary Key would take that place.

CREATE CLUSTERED INDEX [IX_AccountNumber] ON [dbo].[DailyActivity]( [AccountNumber] ASC )

And also add a non clustered index on the Activity column as it needs it for this part of the WHERE clause WHERE Activity = 'Work' and WHERE Activity = 'Lunch'

CREATE NONCLUSTERED INDEX [IX_Activity] ON [dbo].[DailyActivity]( [Activity] ASC )

Now the query plan looks like this

Awesome much improved and performant on the denormalized original table.