Find invalid characters in SQL strings
Sometimes due to circumstances beyond your control you get garbage in your database. Optimally you would filter this data out from getting to and residing in your data storage, but decisions are made and things are built to allow for partial mistaken data rather than complete perfect data. Today I came across such a situation where a VARCHAR field has been accepting keyboard control characters from the data stream to be saved to the table rows.
Example data
Id | Account |
---|---|
1 | abcdef |
2 | a!bcdef |
3 | ab_cdef |
4 | ab\r\ncdef |
In the above the account with the carriage return line feed is the one that should be excluded all the others are valid account numbers and should remain.
I need to convert each char value into an ASCII integer code and then exclude the control characters within the index
As in I want row 4 data account ab\r\ncdef
to give me this
Character | ASCII Code |
---|---|
a | 97 |
b | 98 |
13 | |
10 | |
c | 99 |
d | 100 |
e | 101 |
f | 102 |
Now I need to exclude any ASCII Code from 0 - 32
and 127
and delete the account row from the original data table
How to efficiently break apart the VARCHAR string into each character as a row data so I can do SQL set operations on it?
Build a table of index positions and then join it against the string split by SUBSTRING. My goal is to have a table like this
id |
---|
0 |
1 |
2 |
Split SQL Example
SELECT [i].[id], SUBSTRING([word].[letter], [i].[id]+1, 1) [Letter], ASCII(SUBSTRING([word].[letter], [i].[id]+1, 1)) [AsciiCode]
FROM
(
SELECT 'ab' + char(13) + char(10) + 'cdef' [letter]
) as [word]
INNER JOIN [#temp_index] [i] on [i].[id] < LEN([word].[letter])
Results
id | Letter | AsciiCode |
---|---|---|
0 | a | 97 |
1 | b | 98 |
2 | 13 | |
3 | 10 | |
4 | c | 99 |
5 | d | 100 |
6 | e | 101 |
7 | f | 102 |
Here I am selecting the VARCHAR
field with a SELECT
statement which will split the field up by each character to a row by the nice trick of using the id of the mapping table (filled with consecutive integers and then taking that index as the substring to reference on the original input varchar)
Hint
SELECT *
FROM
(
SELECT 'ab' + char(13) + char(10) + 'cdef' [letter]
) as [word]
INNER JOIN [#temp_index] [i] on [i].[id] < LEN([word].[letter])
Will give you
letter | id |
---|---|
ab cdef | 0 |
ab cdef | 1 |
ab cdef | 2 |
ab cdef | 3 |
ab cdef | 4 |
ab cdef | 5 |
ab cdef | 6 |
ab cdef | 7 |
So apply the SUBSTRING method to pull out the character at the matching index and then use the ASCII function to convert it.
Before I go into the complete example to generate a data range in table format nicely use a recursive common table expression like so
DROP TABLE IF EXISTS #temp_index
CREATE TABLE #temp_index(id INT primary key);
WITH t(n) AS (
SELECT 0
UNION ALL
SELECT n+1 FROM t WHERE n < 500
)
INSERT INTO #temp_index
SELECT n
FROM t
OPTION (MAXRECURSION 0);
SELECT *
FROM #temp_index
This is something I have explained how to do in another post here
Complete Example
CREATE TABLE [#temp_index] ([id] INT primary key);
WITH t(n) AS (
SELECT 0
UNION ALL
SELECT n+1 FROM t WHERE n < 200
)
INSERT INTO [#temp_index]
SELECT n
FROM t
OPTION (MAXRECURSION 0);
-- Skip ASCII control charactes 0 - 31 and 127
-- https://theasciicode.com.ar/extended-ascii-code/lowercase-letter-a-acute-accent-ascii-code-160.html
-- Sadness it came to this
DROP TABLE IF EXISTS [#ascii_allowed_chars]
CREATE TABLE [#ascii_allowed_chars] ([id] INT primary key);
WITH t(n) AS (
SELECT 32
UNION ALL
SELECT CASE WHEN n = 126 THEN n + 2 ELSE n + 1 END FROM t
WHERE n BETWEEN 32 AND 254
)
INSERT INTO [#ascii_allowed_chars]
SELECT n
FROM t
OPTION (MAXRECURSION 0);
DROP TABLE IF EXISTS #the_tests
CREATE TABLE #the_tests
(
id int IDENTITY(1,1) PRIMARY KEY NOT NULL,
tocheck VARCHAR(500)
)
insert into [#the_tests]
(
tocheck
)
values
('QWERTY ' + char(13) + ' AnotherWord'),
('this i €s fine'),
('© hahahÿ'),
('© hahahÿ' + char(127)),
('© hahahÿ' + char(15) + 'abc'),
('© hahahÿ' + char(255));
DELETE
FROM [#the_tests]
WHERE Id IN
(
select a.id
from
(
SELECT id, tocheck
FROM [#the_tests]
) a
INNER JOIN [#temp_index] v on v.id < len(a.tocheck)
WHERE ASCII(SUBSTRING(a.tocheck, v.id+1, 1)) NOT IN
(
SELECT id FROM [#ascii_allowed_chars]
)
)
SELECT *
FROM [#the_tests]
Results
id | tocheck |
---|---|
2 | this i €s fine |
3 | © hahahÿ |
6 | © hahahÿÿ |
So here the rows 1,4,5
have been removed as they had invalid control characters in them as you can see in the original
RowId | Text |
---|---|
1 | |
2 | 'this i €s fine' |
3 | '© hahahÿ' |
4 | |
5 | |
6 | '© hahahÿ' + char(255) |