Table 1

Overview of the most commonly used machine learning (ML) design components and potential choices together with the associated implications for the ML system

Design component (definition)	Common choices	Implications
Input abstraction (Raw or modified data that are the input for the ML algorithm)	Hand-crafted features	Enables introduction of domain expertise by forcing inference from low-dimensional signals derived from high-dimensional volumes. This approach is generally considered to be more ‘interpretable’ since features are hand-crafted, but the performance of ML algorithms operating on this type of input is usually also more limited
	No abstraction	Training ML algorithms directly on pixel intensities is difficult owing to the high dimensionality of the inputs, which makes it difficult to discover semantic correlations between intensity patterns and desired output during training. If successful, this approach gives rise to state-of-the-art performance at the cost of reduced interpretability
Function approximation (ML algorithm that computes significance of the input variables)	Support vector machine (SVM)	A linear learning model and associated training algorithm that can be made non-linear via a process termed ‘kernel trick’ that implicitly maps the inputs to a higher dimensional feature domain. SVM operates on feature vectors, which are n-dimensional vectors of numerical features that represent the object of interest in the images. All images represented as feature vector then define an n-dimensional vector space, also called feature space. The feature values are assigned weights to construct a linear predictor function SVM, which requires n to be small—that is, the feature space to be low dimensional (abstraction); it performs well in practice
	Random forest learning (RFL)	Random forests are an ensemble learning method that aggregates predictions of multiple decision trees that process a feature vector extracted from the input sample. Random forests are appealing since they possess a natural way of ranking the importance of features for the classification or regression problem
	Deep neural networks (DNN)	A non-linear learning method and associated training algorithm that consists of multiple, stacked layers. Values in every layer are computed by weighted summation of all inputs from the previous layer, the result of which is passed through a non-linear activation function. Neural networks are universal function approximators but in the fully connected approach (any value depends on all values in the previous layer) require input abstraction to be practical
	Deep convolutional neural networks (CNN or DCNN)	A variant of deep neural networks that assumes translational invariance of input patterns, and therefore, enables weight sharing across the spatial domain, which yields substantially more compact models (online supplementary figure 1). Consequently, these models do not require input abstraction and can be applied directly to pixel intensities. Training of these models is complicated, often via backpropagation learning, and requires large amounts of data since, while more compact than deep neural networks, these models still have multiple millions of trainable parameters
Output format (Fitting of the input data to an estimated outcome)	Instance-level classification/regression	Since the output is constrained to the instance level, the ML algorithm performs heavy dimensionality reduction, which may make the learning algorithm susceptible to overfitting
	Pixel-level classification/regression	Spatially resolved output effectively increases the amount of overall training samples at the cost of larger learning models
Supervisory signal (Amount of human guidance and training given to the function approximation algorithm during its development and/or validation study)	Un-/Self-supervised	If no ground truth labels are available for training, learning models can potentially still be trained by optimizing for encodings that, for example, form clusters in the encoded domain or can be decoded to reproduce the input. While such approaches may be able to extract good representations of the input data, they will not be able to correlate these representations with semantic labels unless provided by the user retrospectively
	Supervised, structured	In supervised learning, every instance has an associated ground truth label (outcome) that is to be predicted using the learning model. If the annotation is not perfectly accurate, one usually refers to gold (or reference) standard rather than ground truth. Supervised learning generally yields models with the highest performance, but reliable annotations are usually expensive to obtain
	Supervised, unstructured	In some cases, gold standard annotations are available in unstructured form—for example, as medical records. In this case, methods must be used that process unstructured data to enable model training. In the context of medical records, natural language processing can be used to extract structured annotations