Data Story — How to Get Detailed Data Requirement Right

A good user story come with sufficient detail on how user want her story went. Sufficient detail minimize misunderstanding. After all, stories are made as a communication tool. However please note that we need detail only on the what, not the how.

This post will explain what are the details you need to have, to complete your details of your data requirement. In the end: to avoid misunderstanding. This is a long post, so grab a cup of coffee and let’s get started.

Data story could be applied to a table need to be made by data engineer for analyst, or data that needs to be exposed as a service for the whole company. Each section below explain what are the components of the story so you could get agreed on important aspects of the story.

Business Intention — Understand it Deeply

Data Schema — Elaborate the Definition

Realtime or Batch — Don’t Go Realtime Unless you Really, Really Need to

A good question to ask your user is, let say you get the insight, how quick you could act upon it? Could you act it in realtime, and if you act in realtime will it has business impact? Usually the answer is it’s just nice and cool to have something realtime. Usually it’s OK to be available at 8AM when your user come to work in the morning. Another good question is, how frequent the data change? Sometime it changes rarely, such as user segmentation.

Doing something in realtime is cool, but it’s not cool when you are the one who maintain it. The SLA of down time usually is pressing because business user has high expectation even she might not really need it. You might need to get up at wee hour for nothing.

In addition, when something broken, near realtime data pipeline’s a lot harder to fix. Realtime data usually come from streams, they are unbounded, unordered. Meanwhile usually you deal the fix with a time-bounded data, eg I need to fix data at X date, or between A and B timestamp. It is very painful to rerun or backfill the operation unless you have invested time to create good tooling around it.

Some realtime requirement that usually does make sense in my experience: fraud detection, alerts, and operational data for customer support. When you have to do it in realtime, ensure you have proper tooling.

Data Freshness — Your Processing Time Budget

This is very important because 1) you manage your user expectation better, 2) you need to know how much time budget you have to do processing. Time budget along with processing frequency is very important to determine how much cost you will concur.

In this cloud age, with scalable processing engine you could minimize processing time with the same cost (assume it’s linear) by spawning more instances which mean more cpu cores and memory. However you need to consider how much the cost will be, and how frequent it will need to be executed. Also, how much time retry could be done if the process fail. How long your operation team able to mitigate failure manually before user uses the data.

The Interface — API, SQL, or Semantic Model

After all, it’s just a communication tool…

Disclaimer: The views expressed here represent my own and not those of my employer.

Crazy dad. Data technology enthusiast. Youtube: Insinyur Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store