Exploring the hidden dimensions of an optical extreme learning machine

. Extreme Learning Machines (ELMs) are a versatile Machine Learning (ML) algorithm that features as the main advantage the possibility of a seamless implementation with physical systems. Yet, despite the success of the physical implementations of ELMs, there is still a lack of fundamental understanding in regard to their optical implementations. In this context, this work makes use of an optical complex media and wave-front shaping techniques to implement a versatile optical ELM playground to gain a deeper insight into these machines. In particular, we present experimental evidences on the correlation between the effective dimensionality of the hidden space and its generalization capability, thus bringing the inner workings of optical ELMs under a new light and opening paths toward future technological implementations of similar principles.


Introduction
Over the last decades, Artificial Neural Networks (ANNs) were established as a powerful computing architecture across numerous fields of science and technology [1,2]. Part of its success is linked to the scalability and versatility of the neuromorphic architecture, which with the impending plateau of Moore's law is now pushing towards the development of novel computing hardware, capable of bypassing the limits of electronics miniaturization [3]. Indeed, as the amount of different mathematical operations involved in such algorithms is not vast, mainly involving matrix multiplications and non-linear activation functions, the development of hardware accelerators for ANNs has become an attractive topic of research [4].
In this context, optical-based implementations appear particularly promising, offering non-trivial advantages when compared with electronic devices. Indeed, with the ability to handle information at the speed of light allied to multiplexing capabilities, optical information processing systems have the native potential for fast, massively parallelizable, and energy-efficient approaches. Nonetheless, realizing conventional ANNs with optics requires establishing precise neuron connections which can be quite hard to achieve, often limited by fabrication procedures, materials or device imperfections. This, in turn, makes the already intensive training procedures largely ineffective.
For all these reasons, architectures that can bypass the tuning of all the weights have been increasingly explored for hardware development, from which we can highlight the implementations using Reservoir Computing (RC) [5] and Extreme Learning Machines (ELMs) [6]. In simple terms, the underlying concept of both models is to use a fixed reservoir to non-linearly project the input information onto a high-dimensional hidden space. The training process then occurs only between the hidden layer and the output layer, which strongly reduces the computation complexity and softens the requirements for hardware deployment.
In particular, optical ELMs have already been demonstrated through the use of complex optical media [7] and multimode fibers [8,9], and in principle, many other optical phenomena can also be used to achieve such architectures. For instance, ELMs based on v (3) materials have been demonstrated numerically [10,11], and experimentally [9]. Still, most of the works remain largely empirical, lacking a fundamental understanding of such machines. In this work, we study and implement an optical ELM based on strongly scattering media that is able to process information encoded either in the spatial distribution of the amplitude or the phase of a continuous wave optical beam. Introducing a simple model for the amplitude case, we study the dimensionality of the hidden space experimentally, as a function of different encoding schemes with linear and EOSAM 2022 Guest editors: Patricia Segonds, Gilles Pauliat and Emiliano Descrovi nonlinear intensity measurements. Benchmarking the device on standard ML regression and classification tasks, our results demonstrate the important role played by nonlinearity in the deployment of effective optical extreme learning machines.

Theoretical framework
In simplified terms, the inner workings of an ELM consist in taking a N I -dimensional input X and feed it to an untrained hidden or reservoir layer, recording its output. Thus, for each X (i) of the dataset, we obtain a N o -dimensional vector Y (i) given by where G describes the dynamics of the hidden layer and is commonly referred to as the activation function, and w j a vector of weights for each output channel j. The ELM strategy is now to use this output and multiply it by an output weight vector layer to obtain a prediction for a given task as Put in this way, it is straightforward to see that training an ELM to perform a task is reduced to simply computing a linear transformation performed by the output weight vector b. One way to perform this while preventing the overfitting of the model is to fit the vector b via Ridge regression by minimizing a regularized loss function where jÁ j jj denotes the Frobenious norm, and k the regularization parameter. To perform this minimization, we take each element of input and associated target data of a training dataset, say pairs X ðiÞ ; T ðiÞ È É N i¼1 , to construct the matrix H by stacking all the output states as rows, Constructing the matrix T with the associated targets stacked in the same way, equation (3) has then an analytical solution given by (for where I is the identity matrix. In theory, and as it happens in neural network architectures, the performance of an ELM is intrinsically connected with the dimensionality of the hidden space and its activation function. Indeed, from the literature, it is mathematically shown that as long as (i) the weights are drawn from a random distribution and (ii) the activation function G is a nonlinear piecewise continuous function, the ELM will feature universal approximation capabilities on a hidden space of dimensions equal or below the dimensionality of the training dataset [6]. Yet, we shall notice that fulfilling these conditions does not warrant by itself the deployment of a working algorithm that is able to generalize well for the task, nor being robust to external noise. As in most neural network architectures, the generalization performance is typically task-specific and shall be discussed for each case individually by taking into consideration the nature of the activation functions.

Implementation of an optical ELM
Our optical implementation of an ELM is based on wavefront shaping techniques and is schematically described in Figure 1, establishing the connection with the ELM framework. In short, we first make use of a Digital Micromirror Device (DMD), capable of both amplitude and phase modulation enabled by Lee holography [12], as the optical encoder to create the input state. The light is then coupled to a multimode fiber via a standard fiber collimator for our working wavelength, which works as the reservoir where the information is mixed. At the exit, the optical field is a speckle pattern that is known to possess Gaussian circular statistics [13] and guarantees the randomness required by an ELM. This pattern is then measured on a high-speed CMOS camera both in the linear and non-linear regime, which constitutes our hidden layer. Upon correct synchronization, the system can work within the kHz rate, limited by the detection and digital processing steps.
In particular, when using amplitude encoding with two distinct encoding regions, as depicted in Figure 1, we can make use of the properties of the optical transmission matrix M to express the output field at the camera image plane as detected in the macropixel i = (l, m) with l 2 1; . . . ; N x f g ; m 2 1; . . . ; N y È É and N x Â N y = N o . Furthermore, the camera detection function F can be either linear F(I) = I (no saturation, low exposure time) or nonlinear FðI Þ ¼ I =ðI þ I sat Þ (saturation, higher exposure time), thus corresponding to distinct activation functions.

Results and discussion
To understand the capabilities of our setup, we have tested it in standard regression and classification tasks. In specific, for the regression task we used a dataset of points randomly sampled from the function f ðxÞ ¼ sinð2pxÞ 2px . For the classification task we used a dataset of points based on the curves x 1 ðhÞ ¼ ð2h þ pÞ cosðhÞ, x 2 ðhÞ ¼ ðÀ2h À pÞ cosðhÞ, y 1 ðhÞ ¼ ð2h þ pÞ sinðhÞ and y 2 ðhÞ ¼ ðÀ2h À pÞ sinðhÞ, where a sample j from class i consists of a pair of points fx i ðh j Þþ N j ; y i ðh j Þ þ N j g j , where h j is sampled from a uniform distribution Uð0; 2pÞ and N j is added random noise from a distribution Uð0; 1Þ. For both methods, we have used a total of 300 samples and trained with 80% of the whole dataset and tested the performance in the additional 20%. For both procedures, in order to encode the information in the optical domain, we have defined Aðq i Þ ¼ q i À q min q max À q min , where q i is a generalized coordinate, and q max and q min are the greatest and lowest coordinates within the dataset, respectively. For the scope of this manuscript we will only analyse the results for amplitude modulation, obtained by aggregating groups of DMD pixels resulting in various modulation levels.
In Figure 2 we present the results for the regression task. First, it is straightforward to see that the saturation regime increases the performance both for the training and test datasets. This observation matches our empirical expectation and can be confirmed by making a connection with the dimensionality of the hidden space. To achieve this, we computed the rank of the output matrix H by making use of the singular value spectrum. Still, we should take into consideration the effect of experimental noise, which can artificially increase the dimensionality of the hidden space. Anchored on Weyls inequality [14], we did this by counting the number of singular values of H above the highest singular value of the noise matrix of the i-th experiment where H h i represents the average over 100 experiments.
As it can be seen in Table 1, the ELM performance increases with the rank. This happens because while both activation functions are nonlinear, the non-saturated regime only provides a second-degree polynomial while the saturation regime features a saturable response which can only be approximated by a higher order polynomial, effectively increasing the dimensionality of the hidden space.
Regarding the classification task, a benchmark result is found in Figure 3, together with a summary in Table 2. Again and as expected, the camera saturation results in increasing the dimensionality of the hidden space, allowing us to achieve higher accuracy. Also, it is interesting to see that the methodology provides a good generalization performance, separating the regions as intended. where the optical signal follows to a multimode fiber producing a speckle pattern which is collected with a digital camera, constituting the hidden reservoir layer. The weights are then calculated digitally to be applied at the hidden layer to get a prediction.

Fig. 2.
Regression performance under amplitude modulation. In addition to the results of the 80-20% holdout strategy, we also represent a test for the robustness of the implementation by testing for a dataset with 5% of additional white noise at the end of the hidden layer. Finally, to test the performance of the optical ELM in more complex tasks such as processing and classifying images, we tested the setup on the classification of handwritten digits through the MNIST dataset (1797 images, with the same 80-20% holdout strategy) [15]. Overall, we obtained accuracies around 93%, with a confusion matrix depicted in Figure 4.

Final remarks
In this work, we demonstrated the implementation of an optical extreme learning machine that is able to process information encoded in the wavefront of an optical beam by making use of a multimode fiber and a camera detector. Using both standard regression and classification tasks, we have shown that the setup is capable of achieving good computing performances. Furthermore, by studying the dimensionality of the hidden space and comparing it against performance and generalisation capability, we have demonstrated a correlation between the two which aligns with the theoretical predictions. In particular, an increase of the performance can be obtained by including physical nonlinearities within the system, which is done using the saturation of the detection system. Put into perspective, the findings enclosed confirm the optical ELMs as a promising platform for versatile non-Von Neumman analog computing, while simultaneously paving the way for a better understanding of such devices.