Data transforms with Sqlite - Part 1
Introduction
Working through the question of given text line inputs of a text line with a format of "{City name},{Date},{Max temperature in Celsius}"
parse the input lines if possible and find the maximum average temperature for a city in a particular month of a target year. If there are ties display all maximums.
"London,2018-04-01,4.4",
"London,2018-04-02,5.4",
"Paris,2018-04-03,3.4",
"Paris,2018-04-03,{sdflajsdf}",
"Toyko,2018-04-01,4.9"
Therefore in the above input London, Tokyo
will both have an average maximum temperature of 4.9
Celsius and Paris has an ignored value because of invalid temperate date reading where specified {sdflajsdf}
Thinking this over and the default assumption is take the list of string, parse the values and then use a hash table or dictionary and perform a transform of an average across applicable dates in a year/month. There are several steps here that make this somewhat tedious in my opinion because you have to do group by of the following fields city name, year, month
and then gather those values as an input to your average function. Simply using a dictionary means you will have embedded dictionaries like so
city_temps = { 'London': { 201804: [4.4, 5.4] } }
In this case you have to keep adding values to the inner dictionary depending on the combination key of yearmonth
= 201804
of you, but this seems painful to me when I think you should rely on database concepts because all this data is related and python has native support for sqlite3 and therefor an in memory data store would free you up to perform simpler relational queries.
Sqlite Database Plan
That means I want to have a database table that holds the above records like so

With this data structure I can write a query like this to get the max temperatures of said cities for a target year/month
WITH avg_temps_per_city AS (
SELECT cw.name, AVG(cw.temperature) AS avg_temp
FROM CityWeather cw
WHERE CAST(strftime('%Y', cw.measure_date) AS int) = 2018 AND
CAST(strftime('%m', cw.measure_date) AS int) = 04
GROUP BY cw.name
)
SELECT ac.name, ac.avg_temp
FROM avg_temps_per_city ac
WHERE ac.avg_temp = (SELECT MAX(avg_temp) FROM avg_temps_per_city);
Here I am creating a Common Table Expression (CTE) to average out a particular cities data for the year/month in question and then using that again to find the max value of said values to give me any tie values. Note in the above I use the strftime function to extract the Year = ('%Y')
and Month = ('%m')
portions of the datetime I stored in the database.
Implementation
I need to do the following
- Create the database in memory
- Parse and load the records into the table
- Run the query to extract the appropriate data
Here I want to create a dataclass to hold the result data
@dataclasses.dataclass
class CityTemp:
city: str
average_temperature: decimal.Decimal
time_interval: datetime
def __str__(self) -> str:
return f'{self.city} - {self.average_temperature} for {self.time_interval.year} - {self.time_interval.month}'
In memory creation of the sqlite3 database con = sqlite3.connect(':memory:')
Then create the table with appropriate data types con.execute("CREATE TABLE CityWeather(name text, measure_date datetime, temperature real);")
Now a for loop to parse and insert the data
for input_line in city_weather_inputs:
data_input = parse("{:w},{:ti},{:F}", input_line)
if data_input is None:
continue
city = data_input[0].lower()
con.execute("INSERT INTO CityWeather (name, measure_date, temperature) VALUES (?, ?, ?)", (city, data_input[1], float(data_input[2])))
Pull it all together as a method
def highest_temps(city_weather_inputs: list[str], input_date: datetime) -> list[CityTemp]:
con = sqlite3.connect(':memory:')
con.execute("CREATE TABLE CityWeather(name text, measure_date datetime, temperature real);")
for input_line in city_weather_inputs:
data_input = parse("{:w},{:ti},{:F}", input_line)
if data_input is None:
continue
city = data_input[0].lower()
con.execute("INSERT INTO CityWeather (name, measure_date, temperature) VALUES (?, ?, ?)", (city, data_input[1], float(data_input[2])))
results = con.execute("""
WITH avg_temps_per_city AS (
SELECT cw.name, AVG(cw.temperature) AS avg_temp
FROM CityWeather cw
WHERE CAST(strftime('%Y', cw.measure_date) AS int) = :year AND
CAST(strftime('%m', cw.measure_date) AS int) = :month
GROUP BY cw.name
)
SELECT ac.name, ac.avg_temp
FROM avg_temps_per_city ac
WHERE ac.avg_temp = (SELECT MAX(avg_temp) FROM avg_temps_per_city);
""", { 'year': input_date.year, 'month': input_date.month })
return [CityTemp(result[0], result[1], input_date) for result in results]
Call with the above inputs
ic(highest_temps(["London,2018-04-01,4.4",
"London,2018-04-02,5.4",
"Paris,2018-04-03,3.4",
"Paris,2018-04-03,{sdflajsdf}",
"Toyko,2018-04-01,4.9"], datetime(2018, 4, 1).date()))
Results
[
CityTemp(
city='london',
average_temperature=4.9,
time_interval=datetime.date(2018, 4, 1)
),
CityTemp(
city='toyko',
average_temperature=4.9,
time_interval=datetime.date(2018, 4, 1)
)
]
Conclusion
To me this approach seems more inline and natural to what a final code solution would be because the database approach allows you to scale the data into tried and tested performant data structures and is flexible to do ad hoc querying as well with minimal effort.
Key takeaway here is: