Stock Market Anomaly Detector

This is a writeup of my not too successful stock market anomaly detector. I was inspired by unusualwhales.com and how they created a service on options flow data with filtering. Another inspiration was from massive.com’s blog post on anomalies in the stock market. A simple explanation would be that if a big hedge fund / big account was very confident on a certain company within a certain timeframe (ignoring legalities) they would either buy shares of the company OR buy a leveraged position in the company in the form of options contracts.

This post is not meant to dive into the intricacies of options contracts but rather the software architecture, problems and difficulties I ran into in building a similar service.

Plan

The basic plan was as follows:

Find data provider with options trade data - historical and live
Insert data into a database
Transform data into something I can use along with some sort of maintainable filtering rules
Create a visualization so I can see big moves in the market

Data Provider

I chose massive.com as my data provider because it had the easiest documentation, has flat files for historical analysis, and an easy to use API for live injection of trades via websockets and a RESTful API for quotes. They also have a great support team!

Architecture

Postgres as my database with timescaledb enabled, along with pgAdmin as my administration layer
Go as my websocket server
Python and Bash as my database seeding
Docker compose as my organization

I chose the above because I was mostly comfortable with each above. I also experimented with several software projects when I was testing implementations and optimizing workloads.

I would intake the new historical analysis flat files every morning by injecting the options trade data, do calculations based on 30 day windows to start off with and would calculate baseline thresholds of every contract on the market. If let’s say the volume of contracts in a day is considered 1000 traded per day, divided by 5 minute increments and my last 5 minutes of live trades broke that threshold, the application would alert me that something’s up. This was a very simple calculation as a PoC.

Deployment

Docker compose file that managed postgres with timescaledb enabled and pgadmin.

go run . for the web socket server.

Results

I was able to view live trades on my terminal, compared live trades against my calculated thresholds and was able to make an educated guess if the big trades were bullish, bearish or neutral depending on if the executed trade was near the ask or bid.

Difficulties

Database

I created a docker compose file that managed most things well, however the project rapidly became very storage intensive and my 2 TB drive became very full eventually. My docker volume for the postgres database went up to a terabyte in size.
Timescaledb did things fine. After I indexed the tables within a 1 day time frame window, performance drastically improved from a 30 minute query to under 2 minutes for historical viewing of trades.
Initial seeding of 2 year of trades took an entire night and had to be inserted in batches.
Managing the database was easy with pgAdmin however seeding the database after some time because unmanageable. I had several tables, triggers and functions created to work with the data and I did not have these in code. It would have been nice to find a solution to sync the database’s schema, functions, triggers etc. into code on demand. I was doing this manually for a period of time but eventually stopped as it became too cumbersome.
Date conversions and comparisons.

Web Socket Server (Intake)

Live quotes of the options contracts were the major issue here. I created a 5/10 minute smoothing window of the most recent quotes of a options contract that may or may not have had live data.
Pooling of connections and batching inserts into the database.
Goroutines and debugging multiple channels.
Floating point number calculations.
LLMs are still not good enough in Go and integrating between services in simple docker-compose setups. INSERT BIG LLM was unable to create a simple docker-compose setup with Go and inserting into a bare bones loki setup.

Filtering / Calculations and Maintainability of them

I initially tried the Grafana approach and got some decent historical data visualization, however it was slow and unmaintainable for 10,000+ stocks.
I then tried Loki for tagging each trade in the database by underlying ticker. Performance was putrid and yelled at me for having that many tags.
I then tried using Apache Flink, and I was unable to get this working due to several reasons with starting the service and complexities getting Kafka and ZooKeeper in a docker-compose project. This was a project on the market, not over burdening myself with Apache products I was unfamiliar with. The original idea was to replicate something like this: Real-time anomaly detection on streaming data using Random Cut Forest on Apache Flink
Data provider was unable to give me live quotes per trade after speaking with the support team.
Data provided to me was rarely inaccurate, wasn’t a big deal for me because support was great with working with me and my questions.

Visualizations

I tried metabase and it looked fine. This was before I did database optimization so I was running into major problems with date comparisons and expensive queries. This was also before I was confident on what the project should’ve been visually and I ultimately settled on a simple live terminal based approach.

Future Work / Recommendations for Reader

Try QuestDB instead of timescaledb.
Try Clickhouse it may or may not be appropriate for something like this as its more applicable to metrics rather than trades.
Try metabase again.
Try pgbouncer for pooling postgres database connections.
Try a columnar based approach instead of a row based project. I don’t have enough experience on data lakes and it may be more appropriate for historical based queries.
Implement a dumb caching method to intercept RESTful API requests so we don’t bog down the website with crazy amount of requests. I’m currently working on a simple open source solution for this called dumb-cache.
mise worked well for managing tool dependencies - I used it to install mc, just and jq.
uv is always nice.
Guide on jdk is a very good read when I was exploring using Flink.
Time series theory and mathematical theory behind if certain events in the market are actually causing movements in the market. Some follow up reading:
Use decimal packages for floating point number calculations shopspring/decimal
bubbletea was going to be my tui and it looked very cool.
Log everything
For serious sql projects I recommend sqlfluff.
Batch your sql inserts and pool your connections.