1.2 Vision Statement
Flow cytometry is used to rapidly gather large quantities of data on cell type and function. The manual process of classifying hundreds of thousands of cells forms a bottleneck in diagnostics, high-throughput screening, clinical trials, and large-scale research experiments. The process currently requires a trained technician to identify populations on a digital graph of the data by manually drawing regions. As the complexity of the data increases, this gating task becomes more lengthy and laborious, and it is increasingly clear that elimination of human processing is essential to increasing throughput and consistency. In clinical tests and diagnostic environments, automated gating would eliminate a complex set of human instructions and decisions in the Standard Operating Procedure (SOP) and thereby reduce error and speed results to the doctor; an automated system is often able to order additional tests without the delay required for a doctor to look at the first report. Currently no software performs complex multi-parameter analyses in an automated and rigorously validated manner. FlowDx will fill an important gap in the evolution of the technology and pave the way for ever-larger phenotypic studies and for translation of this research process to a clinical environment.
In order to study this problem, Tree Star has created a system that can apply classification algorithms to well-constructed data sets in high enough volume to study the quality of classifiers. The problem is greatly complicated by the lack of a "ground truth" set of data that has known characteristics, and by the large variance in immune characteristics between subjects. To provide a flexible structure for combining tools, Tree Star has created a set of intermediate file formats that map to the steps applied by the tools created as part of the project. The emphasis on modularity and extensibility is covered in the 2.5 Database Infrastructure Document.
Use Cases
To develop the foundation for initial analysis of algorithms, Tree Star has constructed synthetic data sets that describe different levels of noise in different distributions. We composed Python scripts to build and combine sets, so that any number of simulated classes can be mixed into increasingly complex sets, the goal being to quantitatively measure algorithms' or humans' ability to recognize expected patterns.
Synthetic data are quite useful in developing metrics but have only limited applicability in simulating unexpected flow cytometry data sets. Too much of the analysis is tied to the quality of the preparation and acquisition to be able to model bias in the collection of the data. Our generation of synthetic data is not rich enough to reverse-engineer the creation of compensation and calibration controls to model the user's ability to compensate correctly. Longitudinal studies are important to detecting changes within samples collected on a single instrument. The issues of cross-instrument and especially cross-laboratory comparisons add too many additional dimensions to be varied at this point. We are aware of a large-scale Immune Tolerance Network study that is working to normalize across experimental conditions, but their results confirm our expectations of additional complexity.
We are working with academic collaborators in two specific cases studies -- GvHD and SIV. Both are time studies with multiple time points, treatments, and subjects. The studies contain a variety of quality control problems and limitations in the results, but we wanted to work with data that reflect the real-world problems of contemporary flow cytometry. We are encouraged by collaborators and are pressing ahead on the standardization of instrumental runs, which serve to improve the results of automated classifiers, but our short-term goal remains analysis of these pre-chosen data sets.
Classifiers
To these data sets we are applying established classification algorithms. Our original proposal included five specified families of supervised and unsupervised classifiers. What we have learned from the early development iterations is that the quantitative tools we are developing to measure classifiers' concordance with experts are well-suited to determining the sets of events by which experts converge and diverge in their gating. Using this measure of consensus, we are finding that we can isolate training sets that convey the assay to a supervised classification. Or we can amplify or filter certain areas to model an individual agent's predispositions. We can create workspaces for training purposes and score trainees on their similarity to experts. We can score the quality of training on an assay by measuring convergence of expert opinion before and after training. Or we can create signatures based on where an agent diverges from consensus. Most importantly, we can provide an objective and impartial metric for ranking and comparing the upcoming generation of flow cytometry classifiers.
Collaborators
The key dynamic of this project, and the reason its significance has grown since project inception, is the emergence of a broad academic, government, and industrial interest in solving this classification problem. We now have a user base of over 10,000 flow cytometry analysts, and they are overwhelmed by the requirements of processing the gigabytes of data in large experiments.
A distinct set of projects has arisen to apply bioinformatics processes to this domain. Virtually all are using the R programming language, Bioconductor packages, Flowcyt, flowCore, etc. to apply statistical processes. We have prototyped the pipeline to R scripts and have a project in the timeline to automate the application of third-party classifiers within our framework. Cytobank is a repository with analysis capabilities that have a different set of emphases. We found Matlab ANN implementations and were able to work with classifiers run there. Changes made in the elaboration of the project have defined the set of three intermediate file formats that make it easy to transfer classifier results between any other tools, or in and out of FCS format, which is needed to compare results to the raw collection file.
Growth of flow cytometry continues unabated by financial crises: Beckman Coulter acquired MoFlo, Becton Dickinson acquired Cytopeia, and Miltenyi, Partec, and Accuri are releasing new benchtop instruments. Sony has announced its intention of developing laser diagnostic instruments, a move that is symbolic of the translational stage cytometry has entered. The library of human target proteins that have fluorescent markers is over 700 and growing. The commercial impact of successful classification software affects drug and target discovery, vaccine development, and any number of other translational problems.
Significance
- FlowDx fits the “translational medicine” model of the NIH Roadmap. FlowDx will reduce error in the diagnosis of diseases.
- FlowDx will speed results to physicians, offering the opportunity for patients to learn the outcome more quickly and facilitating faster therapeutic intervention.
- FlowDx will better accommodate large-scale research by allowing greater volumes of complex data to be much more quickly examined, compared, and quantified.
- FlowDx will reduce the expense of analysis. Tree Star estimates a reduction of fifty percent in the cost of cell analysis, based purely by triaging out the first 90% of normal results.
- FlowDx will bring the algorithms for population selection to the customer in a friendly and customizable way. Many clustering algorithms are written in R programming language, which is not accessible for the majority of cytometry labs or researchers in the clinical environment.
Cytometry is a key component of many biomedical studies. The software that can classify biological populations based on cytometry data does not yet exist.