Spending most of my time this year getting into AI, GPU optimization, and figuring out how to build scalable machine learning clusters with Kubernetes (or whatever is the new kid on the block). I’ve been wanting to understand what’s actually happening at the hardware level when you’re running these workloads, not just throwing YAML at a cluster and hoping for the best. There’s a lot of interesting stuff around scheduling GPU workloads, memory management, and making sure you’re not leaving performance on the table. Lots to learn.
Some of the literature I’ve been reading:
- Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (4th Edition)
- Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow by Aurélien Géron
- Pattern Recognition and Machine Learning by Christopher M. Bishop
- AI Systems Performance Engineering - Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch by Chris Fregly