Language Phenotypes: C#, SQL, Python - Part 5

Doing a count uniques function and the choice of language really shows some deep decisions on the part of the designers of said language.  In this case I will show the answers in C# and Python

Given the problem

Return the set of elements in a string with the frequency of the occurrence.  The method signature would look like this

public static Dictionary<string, int> CountFrequency(string input)

What needs to happen is that you should loop through the elements of the input string and count the frequency.  In the case I split the string by turning it into a CharArray

Quick and dirty implementation is like this

public static Dictionary<string, int> CountFrequency(string input)
{
	var results = new Dictionary<string, int>();
	foreach (var element in input.ToCharArray())
	{
		var key = element.ToString();
		if (results.ContainsKey(key))
		{
			results[key] += 1;
			continue;
		}
		results[key] = 1;
	}
	return results;
}

This can definitely be improved, but it works reliably well and is clear in intent for someone to understand.  The complete program would look like the following where it checks to match input to expected output.  Copy and paste it into dotnetfiddle.net for testing

using System;
using System.Linq;
using System.Collections.Generic;
					
public class Program
{
	public static Dictionary<string, int> CountFrequency(string input)
	{
		var results = new Dictionary<string, int>();
		foreach (var element in input.ToCharArray())
		{
			var key = element.ToString();
			if (results.ContainsKey(key))
			{
				results[key] += 1;
				continue;
			}
			results[key] = 1;
		}
		return results;
	}
	
	public static void Main()
	{
		var expected = new Dictionary<string, int>
		{
			{ "a", 4},
			{ "b", 1}
		};
		var output = CountFrequency("aaaab");
		var matches = 0;
		foreach (var expectedItem in expected)
		{
			if (expectedItem.Value == output[expectedItem.Key]){
				matches++;
			}
		}
		Console.WriteLine(string.Format("Expected matches {0} actual matches {1}", expected.Keys.Count(), matches));
	}
}

and you get the following: Expected matches 2 actual matches 2

Cool, but if you put in a unicode type character in the string (🦥) things go bad like so:

var expected = new Dictionary<string, int>
{
	{ "a", 4},
	{ "b", 1},
	{ "🦥", 1}
};
var output = CountFrequency("aaaab🦥");

and you get the following:

Run-time exception (line 35): The given key was not present in the dictionary.

Stack Trace:

[System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.]
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at Program.Main() :line 35

What key is failing? Add this at 35 - Console.WriteLine(expectedItem.Key); and you find it is the emoji unicode sloth 🦥

a
b
🦥
Run-time exception (line 36): The given key was not present in the dictionary.

What happened the string got split and iterated on.  If you stick a Console.WriteLine up at line 12 like so Console.WriteLine(element); you will see that the sloth gets printed out as a weird key of ? or ?? because the unicode is getting chopped off as strings in C# are 2 byte (16 bit) length but this particular element is a 4 byte unicode (32 bit) character and it confuses and blows up the system.  Also this seems to cause an infinite print loop so be careful.

How can we fix this quickly to count our sloth characters? If you are saying characters all have lookup codes 97 = a you would be right and it seems to work but it doesn't match right because the function ToCharArray is giving to characters for sloth 🦥 = 55358 first 2 bytes and 56741 last 2 bytes.

So my expected set doesn't match now

var expected = new Dictionary<int, int>
{
	{ Convert.ToInt32("a"[0]), 4},
	{ Convert.ToInt32("b"[0]), 1},
	{ Convert.ToInt32("🦥"[0]), 1}
};

You could go down the path of doing some encoding swaps to put the string into unicode and then read the bytes and so forth.  The int codes solution does work with a little adjustment though here.

Blah no thanks I got quick way to do this.  Let sql handle data better as it knows the difference between unicode and ansi etc much better.

The plan

What does that look like

public class Counter
{
	public string Element { get; set; }
	public int Counts { get; set; }
}
public static Dictionary<string, int> CountFrequency(string input)
{
	using var db = new SQLiteConnection("Data Source=:memory:");
	return db.Query<Counter>(@"WITH t(pos, element) AS (
				  SELECT 1, substr(@inputText, 1, 1)
				  UNION ALL
				  SELECT pos+1, substr(@inputText, pos+1, 1)
				  FROM t
				  WHERE pos < length(@inputText)
				)
				SELECT element, count(*) as counts
				FROM t
				GROUP BY element", new { inputText = input })
		.ToDictionary(counter => counter.Element, counter => counter.Counts);
}

The key is to split each index position with the recursive CTE and then union the results back up into a dataset then map it back to a class and convert to dictionary.

And the final code looks like this

using System;
using System.Linq;
using System.Collections.Generic;
using System.Data.SQLite;
using Dapper;	

public class Program
{
	public class Counter
	{
		public string Element { get; set; }
		public int Counts { get; set; }
	}
	public static Dictionary<string, int> CountFrequency(string input)
	{
		using var db = new SQLiteConnection("Data Source=:memory:");
		return db.Query<Counter>(@"WITH t(pos, element) AS (
					  SELECT 1, substr(@inputText, 1, 1)
					  UNION ALL
					  SELECT pos+1, substr(@inputText, pos+1, 1)
					  FROM t
					  WHERE pos < length(@inputText)
					)
					SELECT element, count(*) as counts
					FROM t
					GROUP BY element", new { inputText = input })
			.ToDictionary(counter => counter.Element, counter => counter.Counts);
	}
	
	public static void Main()
	{
		var expected = new Dictionary<string, int>
		{
			{ "a", 3},
			{ "b", 1},
			{ "🦥", 1}
		};
		var output = CountFrequency("aaab🦥");
		var matches = expected.Count(expectedItem => expectedItem.Value == output[expectedItem.Key]);
		Console.WriteLine($"Expected matches {expected.Keys.Count} actual matches {matches}");
	}
}

Testing it out you get: Expected matches 3 actual matches 3

Success!!

Python implementation

Well python supports unicode from the start so this whole exercise boils down to the following

input = "aaabc🦥"
expected_output = {"a": 3, "b": 1, "c": 1, "🦥": 1}
print(f"Input and Expected Output are equal: {dict(Counter(input)) == expected_output}")

Python wins again and so easily as well (🙇‍♂️)

Second question

Identify and remove all duplicates from a table with sql.  Now this is use a group by and min/max to pick the elements to keep.  Using SQLite again start with a table and data like this

drop table if exists Thing;
create table Thing (Id integer primary key AUTOINCREMENT, Name varchar(256));
 
insert into Thing (Name) values
('Foo'),
('Foo'),
('Bar'),
('Baz'),
('Boo'),
('Boo'),
('Foo');

You have

Id Name
1 Foo
2 Foo
3 Bar
4 Baz
5 Boo
6 Boo
7 Foo

Use a CTE to identify and delete elements like so

WITH dups 
AS
(
  select MIN(ID) as KeepId
  from Thing
  GROUP BY Name 
)
DELETE
FROM Thing
WHERE Id NOT IN dups;

And results are

Id Name
1 Foo
3 Bar
4 Baz
5 Boo

SQL capabilities coming to the forefront here again.  The idea is the closer you can push the operation code to the data the more powerful things you can do.  Now you weigh this against flexibility, maintenance, and requirements but yeah.