Time Series Data Part 4: A Full Stack Use Case🔗
Roommate Spending Ledger Visualization🔗
Using Pandas + Plotly + SQLite to show a full stack use case of Time Series Data.
Analyze spending over time per person (could be adapted to categories / tags / etc).
See: Github Repo
One time series that any financially independent person should pay attention to is their spending over time.
Tracking your spending gets complicated with roommates and budgets for different categories. It also complicates your understanding of your data with a glance, which is where charts and graphs can help.
There are many personal budgeting and even group budgeting apps, but I wanted to make the simplest possible and stick to the Data and Visualizations as an MVP.
One way to get this data is from a CSV export from a bank account or credit card company. In Part 3 is an app that uses this method on general time series data. Upload a CSV with some column for time stamps and some column to forecast and tune your own prediction model!
The main drawbacks to this paradigm:
- Can't share data between people / sessions
- Can't persist data
- Can't incrementally add data
- Can't update data
Another way is the CRUD paradigm explored in my Streamlit Full Stack Post. With this method we'll be able to operate on individual data points and let our friends / roommates add to it!
(Of course the CSV paradigm could be blended with this)
Now for each aspect of the app, from Backend to Front
There's not much real DataOps in this project since the data is self-contained.
That said, there are some DevOps aspects that are important in the time series world:
- Deployment: having a webserver accessible to multiple users
- Integration: how to get updated code into the deployment(s)
Leaning on Streamlit Cloud sharing checks both of these boxes with ease.
By including a
requirements.txt and specifying a python version for the image, we get a free CI/CD pipeline from any push to a github branch (more providers to come).
It'll provide us with an Ubuntu-like container that installs all requirements and tries to perform
streamlit run streamlit_app.py, yielding a live webserver accessible to the public for public repos!
The Data Engineering aspect involves a bit of data design and a bit of service writing.
I decided the minimum data to track expenses are:
- The day on which the purchase was made
- The name of the person who made the purchase
- Price of the purchase. Tracked in cents to avoid floating point perils
Relying on SQLite, we'll have to represent the date as a string, but
pandas will help us transform it to a date / datetime object.
A table creation routine with SQLite for this might look like:
1 2 3
We also get a free autoincrementing
rowid from SQLite, which will differentiate any purchases by the same person on the same day for the same amount!
Python Object Model🔗
That's all well and good for a DBA, but what about the Python glue?
dataclasses is my preferred way to make Python classes that represent database objects or API responses.
Splitting the Model into a child class for the internal application usage and parent class for database syncing is one way to handle auto-created id's and optional vs. required arguments.
To get some values for playing around with and demonstrating Create / Update, here's a snippet of seeding the database
200 times we'll create and save an Expense object with a hardcoded id and some random values for date, purchaser, and price. (Note the randomized days max out at 28 to avoid headaches with february being short. There's probably a builtin to help with random days, maybe just timedelta with random amount is easier)
kwarg placeholders of the form
:keyname lets us pass the dictionary / JSON representation of our Python object instead of specifying each invidual field in the correct order.
The rest of the CRUD operations follow a similar pattern. Reading is the only hacky function to allow filtering / querying at database level before pulling ALL records into memory.
Reshaping the Data🔗
The Data Science aspect of this involves massaging the data into something useful to display.
The data as it stands is not actually the well formed time series you might have thought.
Sure the date stamps are all real, but what value do we read from them?
The goal is to track spending (
price_in_cents in the database).
But what if we have multiple purchases on the same day? Then we might start treating all the purchases on a given day as stochastic samples and that is not our use case. (But that might fit your use case if you are trying to model behaviour based off of many people's purchases)
Enter the Pandas🔗
Utilizing Pydantic to parse / validate our database data then dumping as a list of dictionaries for Pandas to handle gets us a dataframe with all the expenses we want to see.
selections will limit the data to certain time range and
For this analysis I mainly care about total purchase amount per day per person. This means the rowid doesn't really matter to me as a unique identifier, so let's drop it.
(This indexing selection will also re-order your columns if you do or do not want that)
To handle the summation of each person's purchase per day, pandas
pivot_table provides us the grouping and sum in one function call.
This will get us roughly columnar shaped data for each person
Looking more like a time series!
pivot_table had the minor side effect of adding a multi index, which can be popped off if not relevant
I also added a feature where "All" is a valid selection in addition to all
The "All" spending per day is the sum of each row!
(We can sanity check this by checking for rows with 2 non-zero values and sum those up to check the All column)
To fill in date gaps (make the time series have a well defined period of one day), one way is to build your own range of dates and then reindex the time series dataframe with the full range of dates.
Grabbing the min and max of the current index gets the start and end points for the range. Filling with 0 is fine by me since there were no purchases on those days
To get the cumulative spend up to each point in time, pandas provides
And to analyze percentage contributed to the whole group's cumulative spending we can divide by the sum of each cumulative row.
(We included the "All" summation already, so this case is actually slightly over-complicated)
Grabbing the totals of each spender might be a nice metric to display.
This could also be grabbed from the end of the cumulative data
Pandas also provides a convenient
rolling() function for applying tranformations on moving windows.
In this case let's get the cumulative spending per 7 days per person.
Notice that the value will stay the same on days when the person made
$0.00 of purchases, since
x + 0 = x!
We don't have to sum the rolling values though. Here we grab the biggest purchase each person made over each 30 day window.
Notice that a given value will stick around for up to 30 days, but will get replaced if a bigger purchase occurs!
Now Make it Pretty🔗
Since we did most of the work in pandas already to shape the data, the Data Analysis of it should be more straightforward
We'll use a helper function to do one final transformation that applies to almost all our datasets
4312 rows × 3 columns
Seems like it's undoing a lot of work we've already done, but this Long Format is generally easier for plotting software to work with.
In this case we keep
purchased_date as a column (not index), get a value column called
value, and a column we can use for trend highlighting which is
After that, plotly express provides the easiest (but not most performant) visualizations in my experience
For more of the plotting and charting, check it out live on streamlit!
Created: June 7, 2023