Data transforms with Sqlite - Part 1

Introduction

Working through the question of given text line inputs of a text line with a format of "{City name},{Date},{Max temperature in Celsius}" parse the input lines if possible and find the maximum average temperature for a city in a particular month of a target year. If there are ties display all maximums.

"London,2018-04-01,4.4",
"London,2018-04-02,5.4",
"Paris,2018-04-03,3.4",
"Paris,2018-04-03,{sdflajsdf}",
"Toyko,2018-04-01,4.9"

Therefore in the above input London, Tokyo will both have an average maximum temperature of 4.9 Celsius and Paris has an ignored value because of invalid temperate date reading where specified {sdflajsdf}

Thinking this over and the default assumption is take the list of string, parse the values and then use a hash table or dictionary and perform a transform of an average across applicable dates in a year/month. There are several steps here that make this somewhat tedious in my opinion because you have to do group by of the following fields city name, year, month and then gather those values as an input to your average function. Simply using a dictionary means you will have embedded dictionaries like so

city_temps = { 'London': { 201804: [4.4, 5.4] } }

In this case you have to keep adding values to the inner dictionary depending on the combination key of yearmonth = 201804 of you, but this seems painful to me when I think you should rely on database concepts because all this data is related and python has native support for sqlite3 and therefor an in memory data store would free you up to perform simpler relational queries.

Sqlite Database Plan

That means I want to have a database table that holds the above records like so

With this data structure I can write a query like this to get the max temperatures of said cities for a target year/month

WITH avg_temps_per_city AS (
    SELECT cw.name, AVG(cw.temperature) AS avg_temp
    FROM CityWeather cw
    WHERE CAST(strftime('%Y', cw.measure_date) AS int) = 2018 AND
        CAST(strftime('%m', cw.measure_date) AS int) = 04
    GROUP BY cw.name
)
SELECT ac.name, ac.avg_temp
FROM avg_temps_per_city ac
WHERE ac.avg_temp = (SELECT MAX(avg_temp) FROM avg_temps_per_city);

Here I am creating a Common Table Expression (CTE) to average out a particular cities data for the year/month in question and then using that again to find the max value of said values to give me any tie values. Note in the above I use the strftime function to extract the Year = ('%Y') and Month = ('%m') portions of the datetime I stored in the database.

Implementation

I need to do the following

  • Create the database in memory
  • Parse and load the records into the table
  • Run the query to extract the appropriate data

Here I want to create a dataclass to hold the result data

@dataclasses.dataclass
class CityTemp:
    city: str
    average_temperature: decimal.Decimal
    time_interval: datetime
    def __str__(self) -> str:
        return f'{self.city} - {self.average_temperature} for {self.time_interval.year} - {self.time_interval.month}'

In memory creation of the sqlite3 database con = sqlite3.connect(':memory:')

Then create the table with appropriate data types con.execute("CREATE TABLE CityWeather(name text, measure_date datetime, temperature real);")

Now a for loop to parse and insert the data

    for input_line in city_weather_inputs:
        data_input = parse("{:w},{:ti},{:F}", input_line)
        if data_input is None:
            continue
        city = data_input[0].lower()
        con.execute("INSERT INTO CityWeather (name, measure_date, temperature) VALUES (?, ?, ?)", (city, data_input[1], float(data_input[2])))

Pull it all together as a method

def highest_temps(city_weather_inputs: list[str], input_date: datetime) -> list[CityTemp]:
    con = sqlite3.connect(':memory:')
    con.execute("CREATE TABLE CityWeather(name text, measure_date datetime, temperature real);")
    for input_line in city_weather_inputs:
        data_input = parse("{:w},{:ti},{:F}", input_line)
        if data_input is None:
            continue
        city = data_input[0].lower()
        con.execute("INSERT INTO CityWeather (name, measure_date, temperature) VALUES (?, ?, ?)", (city, data_input[1], float(data_input[2])))
    results = con.execute("""
        WITH avg_temps_per_city AS (
            SELECT cw.name, AVG(cw.temperature) AS avg_temp
            FROM CityWeather cw
            WHERE CAST(strftime('%Y', cw.measure_date) AS int) = :year AND
                CAST(strftime('%m', cw.measure_date) AS int) = :month
            GROUP BY cw.name
        )
        SELECT ac.name, ac.avg_temp
        FROM avg_temps_per_city ac
        WHERE ac.avg_temp = (SELECT MAX(avg_temp) FROM avg_temps_per_city);
        """, { 'year': input_date.year, 'month': input_date.month })
    return [CityTemp(result[0], result[1], input_date) for result in results]

Call with the above inputs

ic(highest_temps(["London,2018-04-01,4.4",
			   "London,2018-04-02,5.4",
			   "Paris,2018-04-03,3.4",
			   "Paris,2018-04-03,{sdflajsdf}",
			   "Toyko,2018-04-01,4.9"], datetime(2018, 4, 1).date()))

Results

[
CityTemp(
		city='london',
		average_temperature=4.9,
		time_interval=datetime.date(2018, 4, 1)
	),
CityTemp(
		city='toyko',
		average_temperature=4.9,
		time_interval=datetime.date(2018, 4, 1)
	)
]

Conclusion

To me this approach seems more inline and natural to what a final code solution would be because the database approach allows you to scale the data into tried and tested performant data structures and is flexible to do ad hoc querying as well with minimal effort.

Key takeaway here is:

💡
The selection of your data structures assists or impedes the algorithms you are trying to implement