Data Story — How to Get Detailed Data Requirement Right
A good user story come with sufficient detail on how user want her story went. Sufficient detail minimize misunderstanding. After all, stories are made as a communication tool. However please note that we need detail only on the what, not the how.
This post will explain what are the details you need to have, to complete your details of your data requirement. In the end: to avoid misunderstanding. This is a long post, so grab a cup of coffee and let’s get started.
Data story could be applied to a table need to be made by data engineer for analyst, or data that needs to be exposed as a service for the whole company. Each section below explain what are the components of the story so you could get agreed on important aspects of the story.
Business Intention — Understand it Deeply
I like to call it intention because sometime how we achieve it could be different from what is requested to us. The intention is what is important. Being creative to achieve the intention might mean a lot less work to do meanwhile maximizing the impact. However, this require understanding of business context. You might want to spend some story mapping sessions to understand the intention. A good reference for this is User Story Mapping book by Jeff Patton and Peter Economy.
Data Schema — Elaborate the Definition
In the end, when you are talking about data requirement, it’s all about schema. What data is required, and what are the definitions. A good schema at least has column name, description, data type, and sample value. It is very important to get data description right, elaborate it, because here is misunderstanding usually happen. You might also want to understand what are the unique identifier of data expected, the “primary keys” to understand how granular the data should be.
Realtime or Batch — Don’t Go Realtime Unless you Really, Really Need to
For the sake of definition, let say realtime means less than 10 seconds (it’s not system-critical-realtime), and batch is usually scheduled (hourly, daily, etc). If your user request for something realtime, you really have to dig deeper whether it really has to be realtime. Often, it’s just something that’s nice to have.
A good question to ask your user is, let say you get the insight, how quick you could act upon it? Could you act it in realtime, and if you act in realtime will it has business impact? Usually the answer is it’s just nice and cool to have something realtime. Usually it’s OK to be available at 8AM when your user come to work in the morning. Another good question is, how frequent the data change? Sometime it changes rarely, such as user segmentation.
Doing something in realtime is cool, but it’s not cool when you are the one who maintain it. The SLA of down time usually is pressing because business user has high expectation even she might not really need it. You might need to get up at wee hour for nothing.
In addition, when something broken, near realtime data pipeline’s a lot harder to fix. Realtime data usually come from streams, they are unbounded, unordered. Meanwhile usually you deal the fix with a time-bounded data, eg I need to fix data at X date, or between A and B timestamp. It is very painful to rerun or backfill the operation unless you have invested time to create good tooling around it.
Some realtime requirement that usually does make sense in my experience: fraud detection, alerts, and operational data for customer support. When you have to do it in realtime, ensure you have proper tooling.
Data Freshness — Your Processing Time Budget
This detail usually forgotten to be discussed when you decide to go with batch. For example, your user need the data to be ready at 8AM. But until when? Is it until the end of the previous day (ie. 23.59 yesterday), or is it 7.59AM today?
This is very important because 1) you manage your user expectation better, 2) you need to know how much time budget you have to do processing. Time budget along with processing frequency is very important to determine how much cost you will concur.
In this cloud age, with scalable processing engine you could minimize processing time with the same cost (assume it’s linear) by spawning more instances which mean more cpu cores and memory. However you need to consider how much the cost will be, and how frequent it will need to be executed. Also, how much time retry could be done if the process fail. How long your operation team able to mitigate failure manually before user uses the data.
The Interface — API, SQL, or Semantic Model
This one is simple. How you will serve the data. How your user will retrieve the data. API usually being called by machine. SQL usually written by analyst or scientist. Semantic model usually made for business user on BI tools for data democratization.
After all, it’s just a communication tool…
Those detail above is just a guideline. You might want to add a lot more things that suit your organization and use case better. There’s no wrong or right for those requirement detail. You do it right when you and your user have shared understanding on what you want to achieve.
Disclaimer: The views expressed here represent my own and not those of my employer.