4.2.1 Synthetic Data Discussion
From the synthetic data experiments we can observe that all of the automated classifiers that we tested can classify two populations that are discernible to the eye with greater than 99% accuracy. This is an important result for FlowDx users, who would use the software in a high-throughput setting to make many relatively simple classifications automatically and ask the machine to flag samples with measures outside of the normal.
Synthetic data sets have already proved themselves in testing algorithms' ability to differentiate as the clusters begin to overlap. The emergent value of the tool is its ability to compose synthetic data with distributions modeling real-life data and the ability to add synthetic noise to actual data to model confounding issues with real clinical samples.
Synthetic data sets have been developed as a way to rate classification algorithms. Matching particular data sets to algorithms that are well suited to the characteristics of said data should improve the quality of the resulting classifications. Our first set of data was only the first in many different models that we expect to create and evaluate in the next phase of the project. Maciej Simm defined a set of additional types of synthetic data to model the different biological distributions that we want to analyze with algorithms.
| Example | Importance | Likelihood of Error | Cluster | Magnetic Gating | Polyvariate |

We can dictate the sample size, specify the level of noise, and determine the theoretical limitations of clustering to evaluate objectively the performance of conventional manual gating and algorithmic clustering. Synthetic data will be required to have data files that have noise -- such as dead cells, artifacts, incorrect acquisition-defined compensation, wrong stains (i.e., population ratios and numbers do not match those expected), as well as other examples of why a technician would manually fail a cytometry analysis due to a collection or analysis error. We will need to create datafiles with one-dimensional data, such as one stain (with the addition of FSC and SSC), as well as two-, three-, and four-dimensional data files with Gaussian distributions for one, two, three, etc. populations. Our data will have to range from the easy-to-find populations demonstrating perfect Gaussian distributions, to data files that include smeared or indistinct populations, overlapping population distributions, higher and higher troughs or less-distinct breaks between one pop and the other. The important accomplishment in this area is the understanding of the breadth of the different types of data files that are required to adequately test algorithms for each use case that may be encountered. Additionally, when this set of synthetic data is compiled, it will be very useful for other groups trying to test algorithms and needing data sets in which the populations are known (as opposed to manual gating, which requires a consensus to determine the "right" answer for gating for clinical or research data).
From the synthetic data experiments we observe that the automated classifiers that we tested can classify two populations that are discernible to the eye with greater than 99% accuracy. This is an important result for users of FlowDx who would use the software in a high-throughput setting to make many relatively simple classifications
automatically and rely on the machine to flag samples with measures outside of the normal. The emergent value of flowsim is to be able to compose synthetic data with distributions modeling real-life data and the ability to add synthetic noise to actual data to model confounding issues with real clinical samples.
The outlying samples resulting from poor preparation or acquisition are crucially important to include in training data sets, but the outlier is often edited out by the well-intentioned operator who recognizes a problem. Therefore as we continue our efforts, we will be working to obtain use case data and create synthetic data that represent outliers and rejected samples of the quality-checking process in our collaborators' labs. Recognition of bad data by an initial screening analysis is an important step in the work flow.