I've just finished up with a prototype implementation of a supervised learning algorithm, automatically assigning categorical tags to all the items in our company database (roughly 5 million items).
The results look good, and I've been given the go-ahead to plan the production implementation project.
I've done this kind of work before, so I know how the functional components of the software. I need a collection of web crawlers to fetch data. I need to extract features from the crawled documents. Those documents need to be segregated into a "training set" and a "classification set", and feature-vectors need to be extracted from each document. Those feature vectors are self-organized into clusters, and the clusters are passed through a series of rebalancing operations. Etc etc etc etc.
So I put together a plan, with about 30 unique development/deployment tasks, each with time estimates. The first stage of development -- ignoring some advanced features that we'd like to have in the long-term, but aren't high enough priority to make it into the development schedule yet -- is slated for about two months worth of work. (Keep in mind that I already have a working prototype, so the final implementation is significantly simpler than if the project was starting from scratch.)
My manager said the plan looked good to him, but he asked if I could reorganize the tasks into user stories, for a few reasons: (1) our project management software is totally organized around user stories; (2) all of our scheduling is based on fitting entire user stories into sprints, rather than individually scheduling tasks; (3) other teams -- like the web developers -- have made great use of agile methodologies, and they've benefited from modelling all the software features as user stories.
So I created a user story at the top level of the project:
As a user of the system, I want to search for items by category, so that I can easily find the most relevant items within a huge, complex database.
Or maybe a better top-level story for this feature would be:
As a content editor, I want to automatically create categorical designations for the items in our database, so that customers can easily find high-value data within our huge, complex database.
But that's not the real problem.
The tricky part, for me, is figuring out how to create subordinate user stories for the rest of the machine learning architecture.
Case in point... I know that the algorithm requires two major architectural subdivisions: (A) training, and (B) classification. And I know that the training portion of the architecture requires construction of a cluster-space.
All the Agile Development literature I've read seems to indicate that a user story should be the "smallest possible implementation that provides any business value". And that makes a lot of sense when designing a piece of end-user software. Start small, and then incrementally add value when users demand additional functionality.
But a cluster-space, in and of itself, provides zero business value. Nor does a crawler, or a feature-extractor. There's no business value (not for the end-user, or for any of the roles internal to the company) in a partial system. A trained cluster-space is only possible with the crawler and feature extractor, and only relevant if we also develop an accompanying classifier.
I suppose it would be possible to create user stories where the subordinate components of the system act as the users in the stories:
As a supervised-learning cluster-space construction routine, I want to consume data from a feature extractor, so that I can exist.
But that seems really weird. What benefit does it provide me as the developer (or our users, or any other stakeholders, for that matter) to model my user stories like that?
Although the main story can be easily divided along architectural-component boundaries (crawler, trainer, classifier, etc), I can't think of any useful decomposition from a user's perspective.
What do you guys think? How do you plan Agile user stories for sophisticated, indivisible, non-user-facing components?