Preface

Who should read this book

This book is intended for readers who have at least a basic understanding of topics in linear algebra such as vector spaces, eigen values, eigen vectors, matrix inverse and generalized / pseudo inverse. The flow of the book is motivated by my personal journey - from an engineer (2008 - Present) to a statistician (2015 - Present) to an scientist (2020 - Present). However, this is not a unique trait - every book, if seen with a detective’s eye, is the culmination of one or more personal journeys. To add to this, the book will have questions for which I don’t have answers. Therefore, this book may be a futile effort towards compiling a coherent data science story. It may end up as an unstructured collection of blog posts.

(Updated on 06/16/2020): The planned content has a lot in common with Mathematics for Machine Learning book by Marc Peter Deisenroth, A. Aldo Faisal and Cheng Soon Ong. The planned content also has a lot in common Regression Diagnostics by John Fox. The difference is that this book tries to establish a connection between basic mathematical statistics and topics in machine/deep learning.

(Updated on 10/02/2020): Daniel Friedman is writing a similar book that is freely available as an online book as well as in PDF format. The resemblance of the URL is purely coincidental. The key difference is that this book will focus more on mathematical rigor (such as proofs of convergence, research papers, etc.) and the whys (not just the hows), and less on implementations in high level languages such as R and Python. Most of the codes will be written in C / C++ / Shell to achieve higher speed (may not be as fast as vectorized code in base package in R or numpy in Python). I recommend Daniel’s book for people who are interested in a fine balance of theory and practice.

Note from the author: Science is a journey. Writing a book is a different type of journey. As recently as 2018 I started understanding the journey of the author when I read a book. It’s my turn to share my journey. Not everyone will relate to it - that’s ok. I expect a dropout rate of almost 100%. The intent of this book may be to appeal to a small number of people who are interested in connecting their line of work with data science, similar to how I connected to data science from aircraft engineering (especially computational fluid dynamics).

Who should not read this book

This book is not intended for people who have no background in data storage / retrieval, computing, linear algebra, statistics or machine learning and want a 1-hour summary of topics that will make them the ‘best data scientist in the world’. This book does not intersect with Kaggle competitions because they were not a part of my journey.

Approach to Data Science

The primary objective of this book is to provide one of many paths to data science. In my opinion data science is about uncovering science from data and not just about accurate curve fitting. Of course, there is no denying that empirically accurate curve fitting is a reasonable way to do data science. People who choose this path may find inspiration from scientific greats like Michael Faraday and Ernest Rutherford.

Readers of this book are assumed to be motivated by theoretical aspects of pattern recognition and wish to explain insights from experimental analysis using theory, or wish to use theory to design experiments that uncover scientific insights. Readers may find inspiration from scientific greats like James Clerk Maxwell and Niels Bohr.

This book almost surely will not provide elegant solutions to real life problems in data science. Therefore, the concepts in this book are not comparable to Maxwell’s equations or quantum mechanics. Given such an objective scientific statement, why should one read beyond this paragraph? The answer lies in the following statement by one of the greatest statisticians, George Box.

Industry needs people who can get the job done using libraries because the number of open business problems is very larger that fundamentals can take a back seat. This book will avoid such an approach and focus on the details.

Golden rules

There are few golden rules that appear across topics in this book:

  1. There is no free lunch
  2. Not all stories have a fairytale ending; some stories don’t have an ending
  3. Pattern analysis may be an art, but we will stick to science. This gives us great power, and with great power comes great responsibility. We try to separate known-knowns from known-unknowns, unknown-knowns and unknown-unknowns
  4. Here’s a cliched one: hard work has no substitute because I’m not smart enough to provide smart solutions to smart people

Note: Murphy’s law: “Anything that can go wrong will go wrong” given infinite opportunities. This is a finite book, but it has large number of opportunities for error, bias and misleading ideas. Few of these experiences (biases) are intentional and will be marked clearly with possible justifications.