Metrics Driven Development

In order to gain insight into application usage, we need to log useful information. But what is useful information? How do analytics and business intelligence platforms extract information? 6 January 2020

# Background

When developing a product, analytics often get left behind as a separate, non-developer concern, much like operations used to be. And much like it, perhaps it is time it gain first class status as part of regular development. After all, if the product fails, there's nothing to develop! Let's have a look at what analytics is all about, and how developers can start thinking about it now.

# Dimensions, metrics, & events

Let's start with some definitions.

Dimensions are properties of your application, for example details of your customers like location and age, or details of your products like quantity and price. There are N of dimensions, where N can measure in the hundreds.
Metrics are quantitative measurements within a given dimension, for example the age of the customer, price of the product, and so on. Metrics are usually numerical, including boolean values true/false, but they can be strings too. In this case, the strings are often tokenised and converted into enums and thus ultimately into numbers as well.
Events are sample points within this N-dimensional space. They inherently have scope, for example "user", "session", or "request" scopes - it is only sensible to combine and compare dimensions within the same scope. Scopes can be represented as additional dimensions. We want to collect as many of these as possible; thousands and millions of them.

# N-dimensional what?

It really is not rocket science. Think of it as a data table where every dimension is a column, every event is a row, and metrics are values within the cell. The table is sparse, as not every cell is populated. Rows are also not evenly populated, different rows can have different subsets of columns populated.

Here's an example table with some rudimentary dimensions (columns). We see that some dimensions are scope related, while others are natural dimensions of the application.

scope dimensions			natural dimensions
user id	session id	request id	location	age	quantity	price
…	…	…	…	…	…	…

# Business dimensions

In addition to natural dimensions, we often have business dimensions: for example, the number of enquiries made, products purchased, products returned, etc.

These business-specific dimensions should be defined early to drive the development. We can't build and optimise what we're not measuring. Consider the inverse strategy, which is to rely on "store-everything-and-analyse-later". Unfortunately, this strategy is leaned on far too often.

# Analysis of the space

Up to this point, we have 3 classes of dimensions: scope, natural and business, on which we can perform some analytics.

We can perform basic statistics like mean, standard deviation, etc in one dimension; answering questions like "what's the average purchase price?". We can also group the statistics by another dimension; answering questions like "what's the average purchase price by gender?". When combined with business dimensions, we can answer questions like "what's the demographic (combination of location, age, gender) that is most likely to buy higher margin products?"

Mathematically, these questions are really about identifying the shape of event clusters in our N-dimensional space. The shape (or lack of) informs us which dimensions are correlated (or not) with which.

# Control dimensions

There is a 4th class of dimensions, which I'll call the control dimensions. These are additional properties that shape the elasticity of demand. These are features that make the application more appealing to users, for example, branding, messaging, call-to-action triggers, and so on.

These control dimensions are usually not measured unless a test window is currently open . Once success is proven, it is usually merged into the application and the dimension eliminated.

When implemented correctly, these can be the most valuable dimensions in tuning your application for success.

# Derived dimensions

Not all of the 4 classes of dimensions described: scope, natural, business, and control dimensions need to be stored in its final form directly. Indeed, it can be more useful to store them in raw or intermediate forms, then transform them to its final form during analysis.

For example, we can store a customers birth date, but we may choose to analyse their age at time of purchase instead. By flexibly storing their birth dates, we leave open the possibility to calculate their current age. If we had stored their age at time of purchase instead, we'd have lost that flexibility.

Use your judgement though, as you can easily fall into the "store-everything-and-analyse-later" trap with this approach.

# Tools

If this sounds familiar, it is probably because this a common and solved problem. There already exists a range of business intelligence and analytics tools to perform the collection and analysis of data. On the backend, there's tools like Splunk , Sumo Logic , and more. On the frontend, there's tools like Google Analytics , Adobe Analytics , and more.

Or, indeed you can roll your own! All you need is a database or a spreadsheet to store the N-dimensional data, and you can use whatever tools necessary to extract the shape of the event clusters. Indeed, this is what the aforementioned tools do, underneath the shiny user-friendly interface.

# Conclusion

Analytics need not be a black box that get relegated to the marketing team. It is integral to application development. I hope that by casting the problem into a mathematical one, it becomes obvious what needs to be collected and how, and subsequently, what needs to be analysed and how.