Find invalid characters in SQL strings

Sometimes due to circumstances beyond your control you get garbage in your database.  Optimally you would filter this data out from getting to and residing in your data storage, but decisions are made and things are built to allow for partial mistaken data rather than complete perfect data.  Today I came across such a situation where a VARCHAR field has been accepting keyboard control characters from the data stream to be saved to the table rows.

Example data

Id Account
1 abcdef
2 a!bcdef
3 ab_cdef
4 ab\r\ncdef

In the above the account with the carriage return line feed is the one that should be excluded all the others are valid account numbers and should remain.

I need to convert each char value into an ASCII integer code and then exclude the control characters within the index

As in I want row 4 data account ab\r\ncdef to give me this

Character ASCII Code
a 97
b 98
13
10
c 99
d 100
e 101
f 102

Now I need to exclude any ASCII Code from 0 - 32 and 127 and delete the account row from the original data table

How to efficiently break apart the VARCHAR string into each character as a row data so I can do SQL set operations on it?

Build a table of index positions and then join it against the string split by SUBSTRING.  My goal is to have a table like this

id
0
1
2

Split SQL Example

SELECT [i].[id], SUBSTRING([word].[letter], [i].[id]+1, 1) [Letter], ASCII(SUBSTRING([word].[letter], [i].[id]+1, 1)) [AsciiCode]
FROM
(
	SELECT 'ab' + char(13) + char(10) + 'cdef' [letter]
) as [word]
INNER JOIN [#temp_index] [i] on [i].[id] < LEN([word].[letter])

Results

id Letter AsciiCode
0 a 97
1 b 98
2 13
3 10
4 c 99
5 d 100
6 e 101
7 f 102

Here I am selecting the VARCHAR field with a SELECT statement which will split the field up by each character to a row by the nice trick of using the id of the mapping table (filled with consecutive integers and then taking that index as the substring to reference on the original input varchar)

Hint

SELECT *
FROM
(
	SELECT 'ab' + char(13) + char(10) + 'cdef' [letter]
) as [word]
INNER JOIN [#temp_index] [i] on [i].[id] < LEN([word].[letter])

Will give you

letter id
ab cdef 0
ab cdef 1
ab cdef 2
ab cdef 3
ab cdef 4
ab cdef 5
ab cdef 6
ab cdef 7

So apply the SUBSTRING method to pull out the character at the matching index and then use the ASCII function to convert it.

Before I go into the complete example to generate a data range in table format nicely use a recursive common table expression like so

DROP TABLE IF EXISTS #temp_index
CREATE TABLE #temp_index(id INT primary key);

WITH t(n) AS (
  SELECT 0
  UNION ALL
  SELECT n+1 FROM t WHERE n < 500
)
INSERT INTO #temp_index
SELECT n
FROM t
OPTION (MAXRECURSION 0);

SELECT *
FROM #temp_index

This is something I have explained how to do in another post here

Complete Example

CREATE TABLE [#temp_index] ([id] INT primary key);
WITH t(n) AS (
  SELECT 0
  UNION ALL
  SELECT n+1 FROM t WHERE n < 200
)
INSERT INTO [#temp_index]
SELECT n
FROM t
OPTION (MAXRECURSION 0);
-- Skip ASCII control charactes 0 - 31 and 127
-- https://theasciicode.com.ar/extended-ascii-code/lowercase-letter-a-acute-accent-ascii-code-160.html
-- Sadness it came to this
DROP TABLE IF EXISTS [#ascii_allowed_chars]
CREATE TABLE [#ascii_allowed_chars] ([id] INT primary key);
WITH t(n) AS (
  SELECT 32
  UNION ALL
  SELECT CASE WHEN n = 126 THEN n + 2 ELSE n + 1 END FROM t
  WHERE n BETWEEN 32 AND 254
)
INSERT INTO [#ascii_allowed_chars]
SELECT n
FROM t
OPTION (MAXRECURSION 0);

DROP TABLE IF EXISTS #the_tests	
CREATE TABLE #the_tests
(
	id int IDENTITY(1,1) PRIMARY KEY NOT NULL,
	tocheck VARCHAR(500)
)

insert into [#the_tests]
(
	tocheck
)
values
	('QWERTY ' + char(13) + ' AnotherWord'),
	('this i €s fine'),
	('© hahahÿ'),
	('© hahahÿ' + char(127)),
	('© hahahÿ' + char(15) + 'abc'),
	('© hahahÿ' + char(255));

DELETE
FROM [#the_tests]
WHERE Id IN
(
	select a.id
	from
	(
		SELECT id, tocheck
		FROM [#the_tests]
	) a
	INNER JOIN [#temp_index] v on v.id < len(a.tocheck)
	WHERE ASCII(SUBSTRING(a.tocheck, v.id+1, 1)) NOT IN
	(
		SELECT id FROM [#ascii_allowed_chars]
	)
)

SELECT *
FROM [#the_tests]

Results

id tocheck
2 this i €s fine
3 © hahahÿ
6 © hahahÿÿ

So here the rows 1,4,5 have been removed as they had invalid control characters in them as you can see in the original

RowId Text
1 'QWERTY ' + char(13) + ' AnotherWord'
2 'this i €s fine'
3 '© hahahÿ'
4 '© hahahÿ' + char(127)
5 '© hahahÿ' + char(15) + 'abc'
6 '© hahahÿ' + char(255)