Loading

 

 

2.1.6.1 Software Development Plan

The definition and specification of FlowDx is covered in other sections of this document.

The vision statement sets out the goals and specific aims of the project.
The Infrastructure Document explains the database behind the scenes and all the tools it uses.
XML Schemas and validation resources are listed in Data Structures & File Formats

2.1.6.1.1 Software Requirements & Specification Document

2.1.6.1.2 Collaborative Process

2.5 DB Infrastructure Plan

2.6 Workflow Document explains the flow of data through the software analysis


Summary

The desired outcome of FlowDx is an analytical and experiment environment that will support quantitative comparisons of classification methods on the same set of data. Since the gold standard is human experts, manual gating is specifically included as a classification method, and many experiments will follow the format of comparing the results of current (manual) analysis with the hypothesis that some automated method can do an equivalent or better job.

Recent publications demonstrate that we are not the only ones interested in this problem, and we should encourage collaboration among the parties. This raises the thorny question of how different systems can compare results without a common file format. Rather than lobbying BD, NIST, or ISAC to adopt our formats, we have designed a pipeline architecture that makes it easy for external tools to express their results in a variety of desired formats and that gives us confidence we'll be able to hook it together.

This modularity is modeled after the Unix pipelining syntax, but nowadays a database takes the place of the pipe. Our tools all have the common interface of connecting to the repository to ask for files of a specific input role, fetch each file, process it, and write the result with its output role back to the database. This independent polling mechanism is very efficient, tunable, and parallelizable, and can ensure the pipe won't get clogged by a bad data set. It provides multiple entry points for integration and, just as importantly, provides multiple checkpoints for a formal specification. The database necessitates early schema development, serves to rationalize and document the data flow as it is built, and provides the potential for any level of scaling.

The next tool in the schematic pipeline takes these result roles and does some normalization, transformation, calculation, etc., and writes its results back to the database.

We use the sequential diagrams because the steps are serial, and it fits the page, but a hub-and-spoke diagram with the database involved in every transaction would be closer to the actual implementation. Nonetheless, the linear pipeline makes a better illustration.

Dedicated Database

As the project plan grew, it became evident that we need a special sort of database research tool for our use case data. We want to be able to execute classification algorithms across all data sets. As our experiment grew in dimension as well as range, we have added the capability of the data repository to compare clustering results using multiple metric formulae. After proposing and investigating multiple libraries, content management systems, and LIMS options, we found a very good pattern to follow in a protein classification database from International Centre for Genetic Engineering and Biotechnology (ICGEB)[16.]


Experimenter asks a question: How many T-cells?
Experimenter sets up a SOP: Gate for CD3+ lymphocytes on Dataset 1-3

Experimenter sends workspaces containing data and SOP to sampling of experts
Experts define clusters to approximate classes.
Experts return workspaces to experimenter for meta-analysis.

Clusters are exported as separate files.
Exported populations are reintegrated with the original file to create masks
Masks are combined to express consensus
Standard classification produced

Experimenter can query database to construct standard sets dynamically.
Consensus is visualized, and experimenter/experts review areas of disagreement.
Unsupervised algorithmic classifiers applied to standard.
Standard used to score / train supervised classifiers.
Standard used to score / train human classifiers.

schema1

Figure 1. Database Schema and example files

Agents are persons interacting with the database.

FCS files are listmode files produced by a cytometer and reflecting its measurements.

.wsp files are so-called Workspace files produced by FlowJo software containing multiple FSC files and any analysis applied to them.

Assays are workspace files containing analysis based on a particular SOP.

TargetPopulations are prose descriptions of the parameters of events of particular interest to the assay.

.popmask files are files of all events in the sample showing inclusion or exclusion from the target population by the classification of either an expert or an algorithm.

Tally Files are files of the total sample showing the probability of inclusion of each event in the target population (the consensus gate).

example

Figure 2. Example files from one workspace analyzed by Agent A and Agent B

figure 3

Figure 3. Database housekeeping - additional utilities perform classification calculations and target population comparison measures.

In Figure 3.:

  • FlowJo user retrieves .wsp with rudimentary gates and metadata regarding the experiment.
  • User analyzes data according to experimental SOP.
  • SAVE utility updates the metadata with a description of activity. It records incoming .wsp, to the database, export utility, and/or file server.
  • Export utility separates target populations into .FCS files and/or .popmask files, and stores them.
  • Tally utility generates combined gates and adds average score per event.
  • Additional utilities perform classification calculations and target population comparison measures.

We use FlowJo as a server side processing tool to perform some of the calculations and as a user interface to generate special FlowJo workspaces. These special workspaces contain instructions for the user to complete tasks and provide a special XML return address so that when the user closes or saves the workspace, it is returned to the database rather than written to the local file system.

Table 1. Database Utilities

Name

Input

Output

Description

Exporter

Workspace

{ FCSPop }

write POIs to individual FCS files

Tally

{FCSPop}

FCSVote

Co-mine multiple FCSPops into one with an average score per event

Assess

FCSVote, FCSPop

MatchRating

Apply match ratio or other metric(s) to one gated population against the consensus of 1 or more other wsp

Renamer

Workspace

Workspace

canonical naming conventions imposed on all files in WS

Gater

FCS, GatingML

{FCSPop }

Apply gates from separate files to yield subsets

StatExtractor

Workspace, Columns

Table

Pull named statistics from workspaces in repository

A FlowDx use case:

1. FlowDx workspace generation

By querying the database, a .wsp file is generated that embodies the sample files, and possibly some initial gates, to identify populations of interest. The workspace includes additional meta-info so that a user of the workspace will connect to our remote engines where the data files are hosted (alternatively, an .acs file can be generated so the user's local engine can operate on local data from the .acs file). The meta-info also enumerates the population of interest.

This could be defined by a template, Administrator, or "seed" workspace.

2. Users perform analysis and save the workspace.
When the workspace is saved, in addition to saving the workspace to the local file system, the entire workspace xml is transmitted to the FlowDx server, with additional meta-info.

3. FlowDx processes the workspace.
When the workspace is received, a script is invoked to process the workspace. The database is updated to reflect the new workspace. The workspace is opened (by command line), the populations of interest are exported to popmask file format, and the database is updated to reflect the new popmask files.

4. FlowDx processes the popmask.
When new popmask info is received, FlowDx determines whether all other required popmask info is available and, if so, a script is invoked to tally the popmasks (using info from the database to determine which popmasks define the consensus). This produces a voting results file, and the database is updated to reflect the new voting files.

Build in the ability to select which agents are included in the consensus group. May want to do different analysis of the popmask files to compare one group of agents against another or do intermediate analysis before all the experts turn in their results.

5. FlowDx processes the voting files.
When voting files have been generated, a script is invoked to assess the voting files. This script calculates Match Ratio, Mallows Distance, V-Measure, etc, and updates the database with the results.
Results could be based on the groups and consensus that are selected by the user via a GUI rather than hard coded into the DB.

Jay Almarode presented the architecture of The FlowDx Analysis Tool at CDW 2009 in Asilomar.

Automated generation and processing of workspaces using human classification will enable larger scale experiments into the nuances of gating. Functionality within FlowJo has enabled us to prototype different levels of automation in gating, but the procedure does not support R scripts or MATLAB analyses, which we recognize as useful. This tool will incorporate external algorithms in batches.

The database tool requires an interface through which we can manage files, dictate the contents of experimental workspaces, adjust the parameters of our clustering algorithms, select and apply comparison metrics to the output of selected algorithms, and view and report the results. Below is a design mockup of the interface for tuning an Artificial Neural Network.

interface1

Figure 4. Configure a Neural Network Training Run (Prior to Pressing 'Calculate')

Researcher selects training populations and the chosen classifier learns the classes and applies them to designated data. Colored plots, and the ability to export a vector of classes with or without the data, are the needed outputs.

Researcher can choose one of the built-in items from a list or can select "import" to call a script from another program, run it in a parallel client and record in the database/repository a vector of classes for further analysis.

We will include the ability to export the vector of classes so that investigators can do further manipulations in R or matlab thereafter. When the researcher runs multiple algorithms, he or she can overlay the results and plot them as a heat map or as a pseudocolor plot, where brighter color represents increasing inclusion rate and dimmer color means cells that are less frequently included.

An important reason to use an automated classifier is to classify data using more than 3 dimensions; therefore a multigraph overview platform is linked with the "M" button. It is possible to track a cell though various parameters and color it by population consistently. Users can scroll through, change parameters, and construct a graphical report.

interface2

Figure 5. Results Display After Calculation

Figure 5 shows a static picture explaining what the classifier is and what the inputs and outputs are. Drop-down menus allow users to adjust parameters but are prepopulated with default values. We need to allow users to use less than all of the parameters as well.

dbschema

Figure 6: Diagram of different analysis agents working with a common data set. Increased interest in this classification problem by bioinformaticians has opened the architecture.

Infrastructure: Repository

There is much more information in the study than the tables in a database. We are generating dozens of procedural documents, requirements, and specifications, and that trend will continue to increase as validation documentation and end-user documentation are created. IBM’s tools for the task are too complicated and expensive for a project this size, and content management solutions are ubiquitous.