Data Modeling: SQLite + Python

Working through a parsing problem of text files and finding the relationships between them.

In this case there are three relevant files

  • UserIP - Holds the IP and UserId tracking information
  • Purchased - Holds the UserIds of every user that makes a purchase
  • AdClick - Contains various information, but the key portions are the IP of the AdClick and the text of the Ad

Question: How to find the total purchases and total clicks of a presented Ad from the various files.

In this case we can start with parsed out files with a small sample of data for example

    ad_clicks_raw = [
        "1.1.1.1, 2025-04-05, Shortbread Cookies",
        "1.1.1.2, 2025-04-06, Shortbread Cookies",
        "1.1.1.3, 2025-04-06, Shortbread Cookies",
        "1.1.1.4, 2025-04-07, Bagels",
    ]
    purchased_raw = [123, 124]

    user_ips_raw = [
        "1.1.1.1, 123",
        "1.1.1.3, 124",
        "1.1.1.4, 125",
    ]

In the above we have two ads Shortbread Cookies, Bagels of that 4 different IPs were presented with the ads however only 2 of the presented ads were purchased as you can see from matching the purchased_raw ids -> user_ips_raw -> ad_clicks_raw.

Therefore from the above input you want to get the following output. Shortbread Cookies was presented 3 times and purchased 2 times and Bagels presented 1 time and purchased 0 times.

Description Purchased Total
Bagels 0 1
Shortbread Cookies 2 3

Normally this would be text processing and string in memory replacement, but there is clearly a strong relational structure to the data presented so I suggest that loading all the data into SQLite tables and performing ad-hoc queries will give you a clear and understandable way to process and inspect the data.

In this example we should create 3 tables: AdClick, UserIP, Purchased for simplicity we will not formalize the keys (primary, foreign) for brevity as the duplication of data would possibly restrict the data handling on larger more complex inputs.

The SQL statements look like this

drop table AdClick;
CREATE TABLE AdClick(ip text, description text);
drop table UserIP;
CREATE TABLE UserIP(ip text, user_id int);
drop table Purchased;
CREATE TABLE Purchased(user_id int);

insert into AdClick (ip, description) VALUES ('1.1.1.1', 'Shortbread Cookies');
insert into AdClick (ip, description) VALUES ('1.1.1.2', 'Shortbread Cookies');
insert into AdClick (ip, description) VALUES ('1.1.1.3', 'Shortbread Cookies');
insert into AdClick (ip, description) VALUES ('1.1.1.4', 'Bagels');
insert into UserIP (ip, user_id) VALUES ('1.1.1.1', 123);
insert into UserIP (ip, user_id) VALUES ('1.1.1.3', 124);
insert into UserIP (ip, user_id) VALUES ('1.1.1.4', 125);
insert into Purchased(user_id) VALUES (123);
insert into Purchased(user_id) VALUES (124);

We need to group the AdClicks by description to get the total clicks and then we need to trace back the purchases by description as well. In this case I think that is two separate statements

Total Clicks

select a.description, count(*) total
from AdClick a
group by a.description

Purchases

select AC.description, COUNT(*) total
from Purchased p
inner join UserIP UI on p.user_id = UI.user_id
inner join AdClick AC on UI.ip = AC.ip
group by AC.description

Now put them together by description, you will need to left join the purchases as not all clicks will have purchases

SELECT sums.description, COALESCE(purch.total, 0) purchased, COALESCE(sums.total, 0) clicks
FROM (
	select a.description, count(*) total
	from AdClick a
	group by a.description
) sums
LEFT JOIN
(
	select AC.description, COUNT(*) total
	from Purchased p
	inner join UserIP UI on p.user_id = UI.user_id
	inner join AdClick AC on UI.ip = AC.ip
	group by AC.description
) purch
on sums.description = purch.description
order by sums.description

SQL Results

Description Purchased Total
Bagels 0 1
Shortbread Cookies 2 3

Correct, now translate to python with in memory sqlite database. I will use dataclasses to represent the raw data converted to table like objects

import dataclasses
import sqlite3
from icecream import ic


@dataclasses.dataclass
class AdClick:
    ip: str
    timestamp: str
    text: str

@dataclasses.dataclass
class UserIP:
    ip: str
    user_id: int

@dataclasses.dataclass
class Purchased:
    user_id: int

@dataclasses.dataclass
class AdSummary:
    text: str
    purchased: int
    clicked: int

def parse_ad_click(input_str: str) -> AdClick:
    items = input_str.split(',')
    return AdClick(ip=items[0], timestamp=items[1], text=items[2])

def parse_user_ips(input_str: str) -> UserIP:
    items = input_str.split(',')
    return UserIP(ip=items[0], user_id=int(items[1]))


def run_analysis(ad_clicks: list[AdClick], user_ips: list[UserIP], purchased: list[Purchased]) -> list[AdSummary]:
    con = sqlite3.connect(':memory:')
    con.execute("CREATE TABLE AdClick(ip text, description text);")
    [con.execute("INSERT INTO AdClick (ip, description) VALUES (?, ?)", (ad_click.ip, ad_click.text)) for ad_click in ad_clicks]
    con.execute("CREATE TABLE UserIP(ip text, user_id int);")
    [con.execute("INSERT INTO UserIP (ip, user_id) VALUES (?, ?)", (user_ip.ip, user_ip.user_id)) for user_ip in user_ips]
    con.execute("CREATE TABLE Purchased(user_id int);")
    [con.execute("INSERT INTO Purchased (user_id) VALUES (?)", ([purchase.user_id])) for purchase in purchased]
    results = con.execute("""
        SELECT sums.description, COALESCE(purch.total, 0) purchased, COALESCE(sums.total, 0) clicks
        FROM (
            select a.description, count(*) total
            from AdClick a
            group by a.description
        ) sums
        LEFT JOIN
        (
            select AC.description, COUNT(*) total
            from Purchased p
            inner join UserIP UI on p.user_id = UI.user_id
            inner join AdClick AC on UI.ip = AC.ip
            group by AC.description
        ) purch
        on sums.description = purch.description
        order by sums.description
        """)
    summaries = []
    for result in results:
        summaries.append(AdSummary(text=result[0], purchased=result[1], clicked=result[2]))
    return summaries


if __name__ == '__main__':

    ad_clicks_raw = [
        "1.1.1.1, 2025-04-05, Shortbread Cookies",
        "1.1.1.2, 2025-04-06, Shortbread Cookies",
        "1.1.1.3, 2025-04-06, Shortbread Cookies",
        "1.1.1.4, 2025-04-07, Bagels",
    ]
    purchased_raw = [123, 124]

    user_ips_raw = [
        "1.1.1.1, 123",
        "1.1.1.3, 124",
        "1.1.1.4, 125",
    ]

    for analysis_result in run_analysis([parse_ad_click(x) for x in ad_clicks_raw],
                           [parse_user_ips(x) for x in user_ips_raw],
                           [Purchased(user_id=p) for p in purchased_raw]):
        ic(f'{analysis_result.purchased} of {analysis_result.clicked} {analysis_result.text.strip()}')

Python Results

ic| f'{r.purchased} of {r.clicked} {r.text.strip()}': '0 of 1 Bagels'
ic| f'{r.purchased} of {r.clicked} {r.text.strip()}': '2 of 3 Shortbread Cookies'

Correct and I would say this is easier to understand and extend because it allows for modeling the data relationships explicitly while allowing inspection and querying.