This section gives an overview of the technologies handling the information sensed by and/or actuated for A metaverse Instance that is related to humans, viz. audio, visual, touch. smell, taste, and brainwaves. For each information type, when meaningful, four aspects are considered: how the human body perceives information (physiology), how the information that the human senses can be digitally represented, and technologies for sensing and actuating the information.
A postulated condition to obtain realistic Metaverse Experiences is that the 3D audio and visual fields be represented and rendered in such a way that Users do not perceive the Metaverse Experience different from the one they would experience in the Universe. Three decades of digital audio and video standards have given rise to several often-competing information representation technologies and standards for sensing, storing/transmitting, and actuating audio and visual information. Products and services based on them are now ubiquitous and used by billions of people every day. The same cannot be said for audio and visual information with a “3rd dimension” which is at a stage where new technologies come to the fore every other day and whatever standard appears finds it hard to achieve global recognition. Proposal and adoption of standards is even more difficult to succeed for other information types such as touch, smell, taste, and brainwaves.
The scope of this Section in its current form is only to identify and characterise a field whose development is of a vital importance to the success of the Metaverse vision.
In this subsection, Audio is used to indicate a signal perceived by the human hearing system. By processing the information that the two sensors (ears) have pre-processed, the brain can create a good internal representation of a 3D audio field in the frequency range of 16 Hz to 16 kHz (approximately).
The focus of this subsection is on:
- Sensors suitable for sensing audio scenes populated by humans and other sound-generating objects for transmission to a Metaverse Instance and/or for local or remote processing to create Audio Scene Descriptions.
- Actuators suitable for actuating Audio Scene Descriptions intended for human consumption.
Some Digital Twin applications may use sound information not intended for human consumption. For instance, ultrasound used by a Connected Autonomous Vehicle typically in the 40-250 kHz range. The current version of this document does not consider these non-human perceptible sensing/actuation technologies.
Sound waves reach the outer ear, are guided through a canal, and they hit a thin membrane called Drum whose oscillations are propagated to three tiny bones, Called the Hammer, the Anvil, and the Stirrup, they amplify the oscillations. The Stirrup hits another membrane called the Oval Window which contains the organ of equilibrium and the Cochlea. The latter contains three canals filled with a liquid: the first conducing the liquid to the tip of the Cochlea, the second taking it back, and the third containing the organ of hearing whose bottom is covered by hair cells. The Cochlea performs a function like a frequency analysis: the base of the cochlea detects high frequencies and lower frequencies are detected by parts that are farther away from the base. The wave of the liquid causes the hair cells to move and their bending activates a neural response in the auditory nerve fibbers of the eighth cranial nerve to the brain.
Sensing devices sense the sound field at the points where they are located. There are different audio information representation formats depending on the sensing device used, e.g., mono, stereo, multichannel and the purpose for which they are used, e.g., compression for storage and transmission.
The format of an Audio Data stream is typically divided in two parts:
- The data generated by conversion of the audio signal(s).
- The metadata that includes:
- Spatial Attitude of the sensing device.
- Sensing characteristics of the microphone(s) used (e.g., cardioid),
- Microphone array geometry (in the multichannel case).
- Sampling frequency.
- Number of bits/sample.
Acquisition of all the data coming from a microphone array may produce a very high-rate bit stream. Different forms of compression are typically used to enable the transmission of the sensed sound field to a Metaverse Instance. A Universe or Metaverse Environment can perform other types of processing, e.g., to extract relevant information for further processing (e.g., the Audio Scene Description, or extraction of the text from speech).
MPEG-H 3D Audio  is a standard for spatial audio coding developed by the Moving Picture Experts Group (MPEG). The standard supports low latency, high quality, and localisable audio requirements, and the quality of the sound after decoding scales up with the bitstream transferring rate and provides a universal representation of encoded 3D sound in channel-based, scene-based, and object-based formats. While channel-based and scene-based are part of the standard mainly for backward compatibility reasons, the main novelty is the object-based format that enables unprecedented flexibility in rendering spatial sound. This format consists of audio objects in channels to be mixed for creating the sensation of the scene-based audio using the speakers.
Spherical microphone arrays are useful tools to capture scene-based audio due to their ability to provide a multi-channel full azimuth and elevation coverage in capturing real-life conditions. Besides, higher order ambisonics (HOA) is an encoding method using the microphone arrays to represent the scene in spherical harmonic coefficients taking the bandwidth limitations for transmission into consideration. Representation of the captured acoustic field in HOA also simplifies the Audio Scene Representations. MPAI-CAE is a standard specifying AI-based technologies for audio related technologies . In this standard, the Sound Field Description Composite AIM is the technology for transforming the Data from the microphone array into SHD. It also transcodes the scene-based format in object-based representations with their metadata thus enabling the recreation of the Audio Scene by using the relevant objects’ spatial attributes.
Transmission of Audio data from a Metaverse to a Universe Environment for rendering as 3D Audio is typically subject to the same bitrate constrains of the opposite transmission direction. The signal rendered to the ears of a human should change based on the actual physical movement of the human.
To mimic the auditory scenes we hear in real life, immersive audio techniques are used to provide a “life-like” sound experience much beyond what traditional methods can provide. Immersive audio offers an audio experience surrounding the listener, creates a sensation over the audience for source arrival directions, distance, and orientations with a credible auditory imitation.
Audio Scene Description is a format to represent the Audio Objects with their Spatial Attitude that is rendered by analysing the spatial attributes of the Audio Objects and managing the resulting experience. The renderer is aware of the number of speakers and their positions in the room or binaural audio.
In this subsection, Visual is used to indicate the a signal perceived by the human visual system.
The focus is on:
- Sensors suitable for sensing visual scenes populated by humans for transmission to a Metaverse Instance and/or for local or remote processing to create a Visual Scene Description.
- Actuators suitable for actuating Visual Scene Descriptions intended for human consumption.
Some Digital Twin applications may use visual information not for human consumption, e.g., a Connected Autonomous Vehicle can use RADAR devices in the frequency range of a few tens of GHz or LiDAR devices in the frequency range close to the visible range. The relevant sensing/actuation technologies are not considered here.
The human retina includes ~5 million photoreceptor cells for colour vision sensitive to the electromagnetic field in the 400 to 700 kHz frequency range (approximately) and ~100 million rods for vision at low light levels. The eye performs several low-level processing to reduce the amount of information transmitted to the brain: edges, temporal changes, moving objects, brightening/dimming of the scene, etc. This data reduction is necessary because the number of receptor cells is two orders of magnitude more than the axons of the optic nerve and in any case the brain would not be able and have the structure to process such a amount of raw information.
Sensing of visual information with a human-made device has close to 2 centuries of history. From static 2D images captured using chemical principles (photography) to dynamic 3D images captured using chemicomechanical principles (cinematography), to dynamic 3D images captured using electronic principles (television), to dynamic 3D images captured with pixel-based depth information to camera arrays.
The format of an Visual Data stream is typically divided in two parts:
- The data stream generated by digitising the visual signals captured by the sensors.
- The metadata that includes:
- The time.
- The Spatial Attitude.
- The camera geometry (in the camera array case).
- The colour space (colours can be reproduced by properly combining RGB colours).
- The number of pixels in the horizontal and vertical direction for each stream (RBG or other).
- The depth information for each pixel.
- The frame frequency.
- The number of bits/sample.
Wholesale acquisition of the data coming from a camera (array) with modern resolutions produces a bitrate of hundreds of Mbit/s or tens of Gbit/s. Different forms of compression are typically used to enable the transmission of the sensed electromagnetic field to a Metaverse Instance. A Universe or Metaverse Environment can perform other types of processing may to extract the relevant information for further processing (e.g., the Visual Scene Description, Object recognition, etc.).
Some visual sensing technologies are:
- A 2D sensor provides information that can be processed to extract the Objects of the Visual Scene, but the result is often not satisfactory.
- A depth sensor is added to substantially improve the creation of a Visual Scene.
- Two 2D sensors placed at slightly offset positions produce images that are processed to create a Visual Scene.
- A 3D scanner captures data from a physical object’s surface and digitally represents its shape in a 3D format. A laser 3D scanner projects a laser line along the surface of an object while the sensor records the distance and the coordinates of each point. The result is a “point cloud” representing the object.
- A precise shifting fringe pattern is projected on the surface of the object to be scanned scanner using a structured light, and two sensors capture the geometry of the object surface based on the pattern distortion and calculate the 3D coordinate by triangulation.
- Motion capture (mocap) captures the movement of a human, e.g., a performer, using sensors and markers attached to them. The captured information is used to animate an avatar model, e.g., in computer animation.
The design of a Head Mounted Display should consider the following data:
- The human eye can typically resolve 1 pixel/min or 60 pixels/º.
- The typical field of view of a human is somewhat larger than 180º.
- The human eye can typically detect flicker up to ~90 Hz for normal scenes.
- Humans typically perceive an environment to be “right” if the Motion-to-Photon latency is <20ms.
The first 2 issues can be resolved by increasing the number of pixels generating light to the human wearing an HMD, and the third issue requires an increase of the bitrate. The fourth issue involves most elements involved in a Metaverse Experience.
The operation of a Visual Actuator is generally based on the projection of the electromagnetic energy generated in correspondence to the Visual Scene Representation of by a Metaverse Instance.
The somatosensory system provides the sense of touch using a range of receptors loocated at various points and depth in the skin and other organs:
- Mechanoreceptors in the upper layers of the skin sense pressure, texture, and vibration, and those in the lower layers and along tendons and joints sense vibrations, skin tension, and limb movement.
- Thermoreceptors sense hot and cold.
- Painreceptors play the role to urge moving away from the cause of the pain stimulus.
- Proprioceptors sense tiny variations in muscle tension and length information from their locations in tendons, joint capsules, and muscles to enable the brain to have a representation of the body in space.
The density of nerve endings at human fingertips is so large that their discrimination capability is almost as good as that of human eyes. Therefore, a Metaverse Instance providing an Experience that is just audio-visual is very far from what that humans can have in the Universe where, by physically interacting with objects, they can perceive more profoundly and meaningfully with tactile experiences and receive more emotional responses.
As described above, the tactile experience is thus highly multidimensional.  reduces it to five dimensions: macro and fine roughness, warmness/coldness, hardness/softness, and friction (moistness/dryness, stickiness/slipperiness).
Most products with haptic functionality are vertically integrated with components that cannot interoperate. The haptic landscape is currently highly fragmented. There are efforts under way in ISO to develop a standard coded representation of haptic signals and potentially to develop a standard coded representation of interactive haptic experiences
Humans use the sense of touch to interact, explore, manipulate, and extract the object properties indicated above and others, such as shape. This information is captured by receptors of various types unevenly distributed all over the body and located at different layers of the skin.
Tactile sensors are data acquisition devices sensing tactile object properties via direct physical contact based on a range of different technologies, such as :
- Capacitive sensors measure the variations of capacitance from an applied load over a parallel plate capacitor.
- Piezoresistive sensors measure the changes in the resistance of a contact when force is applied.
- Optical sensors transduce mechanical contact, pressure, or directional movement into changes in light intensity or refractive index that are detected by visual sensors.
- Magnetic sensors detect changes in magnetic flux caused by the application of a force using the Hall effect, magnetoresistive or magnetoelastic sensors.
- Binary sensors detect on/off events caused by mechanical contact.
- Piezoelectric sensors produce an electric charge proportional to an applied force, pressure or deformation.
- Hydraulic sensors convert fluid pressure into mechanical motion.
Many haptic sensing devices can be borrowed from the expanding field of robotics.
The user experience of a Metaverse Instance can be augment by the use of haptic actuators. Some types of haptic actuators are:
- Eccentric Rotating Mass actuators create a vibration by acting on a small magnetic DC motor that spins an eccentric unbalanced weight.
- Linear Resonant Actuators create a vibration by acting on a voice coil.
- Solenoid actuators create a vibration by acting on a solenoid.
- Piezo Haptic actuators create a vibration using piezoelectric material mounted in a cantilever beam configuration.
- Thermoelectric Device actuators use a thermoelectric device as their thermal source to transform an electrical current into a heat flux based on the Peltier or Seebeck effect.
- Ultrasonic actuators use speakers or integrated 3D ultrasound sensors to transfer tactile effects onto a user’s hands.
- Pneumatic actuators provide haptic actuation by acting on small motors that use air pressure.
According to the more accredited theory, the process giving rise to human perception originates in the neurons terminating in the olfactory epithelium where odour molecules bind to them. The process of human olfaction begins when the hair-like projections of the olfactory sensory neurons located in the nasal cavity are activated by in-air molecules. Activated proteins of the olfactory receptor trigger biochemical reactions. The olfactory bulb picks up the signals coming from receptor cells sensitive to the stimulating molecules. The signals travel to the specific portion of the cortex (piriform) and from there to various other parts of the brain where they are combined with other inputs and eventually interpreted as an odour by another part of the cortex (orbitofrontal).
Olfaction enables humans to sense the chemical composition of their environment which may transmit different sources of information about, e.g., food, other humans, danger signalled by smell, etc.
Currently there is no recognised “Odour Representation” that would offer a digital representation of odours. Such a representation would enable, e.g., odour classification or generation of a specific odour identified by a code. Proprietary solutions designed to satisfy specific needs typically use a trained neural network model to classify odours, making up for the absence of a standard format but inhibiting interoperability.
A machine able to sense odour is called an electronic nose or e-nose. This is an array of sensors able to detect, identify, and measure air-borne molecules. Each odour may be the result of a combination of possibly complex molecules thus making the number of individual odours potentially very large. The current three main classes of sensor technologies are metal-oxide gas sensors, piezoelectric sensors, and conducting polymer sensors.
Odour actuation, i.e., the generation molecules able to stimulate the human nose is done by combining basic olfactants to. Traditional Principal Component Analysis applied to mass spectrometry data of a large number of essential oils can be used to define the basic olfactants, i.e., the much smaller number of odour components. A specific odour can thus be obtained by a linear weighted combination of those basic odours. Assuming that this is the best way to generate the basic olfactants and that most odours can be obtained by weighted linear combination, the process to define a standard for odour synthesis could be easily developed.
It is traditionally assumed that the taste receptors in the mouth can sense five taste modalities (sweet, salty, sour, bitter, and savoury/umami) and that different receptors are dedicated to sense a specific modality out of the five basic ones. However, the existence of other basic taste modalities has also been postulated. Other receptors present in the mouth are trigeminal nerve endings sensing tactile sensations (texture), thermoception (temperature), and nociception (pain). Therefore, what humans call taste is in fact a combination of different experiences from different sources, e.g., smell, food texture, and temperature. The gustatory receptors may very well not be the major contributors to the sense of taste.
The information sensed by the taste receptors is relayed to the brain. The insula is the primary cortical substrate involved in the perception of taste in the mammalian brain. According to some reports, the insula in rodents is organised in distinct regions that selectively respond to one of the five basic tastes. Some other reports state that the cortical neurons processing gustatory information of monkeys respond to multiple tastes, and tastes are not represented in discrete spatial locations.
The five taste modalities are a very basic form of standardisation. However, the complex nature of the gustation sense makes it difficult to identify a path to full standardisation of the gustation sense in the short-to-medium term.
The equivalent of the e-nose is the e-tongue, currently used to replace humans who may not wish to engage in sampling different materials such as water (quality), or beverage (counterfeiting), etc. An e-tongue typically measures the voltage of multichannel electrodes where each electrode responds to certain combinations of molecules. A training process may be used to determine the meaning of a particular set of voltages.
Examples of actuation devices are:
- Electrodes used to simulate the taste and feel of real food in the mouth.
- The National University of Singapore’s digital lollipop (2012) able to transmit to the tongue four of the basic taste sensations (not umami).
- The Norimaki Synthesizer is a rod-shaped device able to simulate any flavour represented by the five basic taste sensations. The device uses five gel nodules made of dissolved electrolytes. The user feels a taste by licking it.
- A taste display reproducing tastes by using data obtained from taste sensors .
The ~85 billion neurons of the brain are electrical devices operating on chemical principles. An approximate description is:
- The channels in the membrane of a cell allow positive and negative ions to flow into and out of the cell.
- The normal potential inside a cell is more negative than the outside by ~-70 mV, but the membrane potential is not constant because of the different inputs from the dendrites.
- Excitatory inputs raise the neuron’s membrane potential and inhibitory inputs make it more negative thus promoting or inhibiting the generation of communication units between neurons.
- If the sum of all excitatory and inhibitory inputs brings the neuron’s membrane potential to ~-50 mV, the neuron fires a spike.
- The spike moves from one neuron to another thanks to chemical processes taking place across synapses.
2 Information representation
The activity of the so-called pyramidal neurons in the cortical brain regions (occipital, temporal, parietal, frontal cortices) is best placed to reach the scalp because they are perpendicularly oriented to the cortical surface because the cell bodies point towards the grey matter, and the dendrites towards the surface.
Thousands of simultaneously activated neurons are required to generate a sufficiently strong signal to travel across the meninges, the skull, and the scalp to be detected by electrodes on the scalp, taking the different conduction properties of each layer into account. The values that oscillate between positive and negative values at frequencies ranging from ~0.1 Hz to ~30 Hz. The higher the frequency, the higher the attenuation.
Signals are classiﬁed depending on the dominant frequency ƒ:
- delta (ƒ < 4 Hz)
- theta (4 Hz < ƒ < 7 Hz)
- alpha (8 Hz < ƒ < 12 Hz)
- beta (12 < ƒ < 30 Hz)
- gamma (ƒ >30 Hz).
Most cognitive processes relevant to the BCI occur within tens to hundreds of milliseconds.
Electroencephalography (EEG) is the most widely used system to capture the electric field of the brain because of its portability, relatively low cost, ease of use, non-invasivity, and high temporal resolution, EEG is employed in a wide spectrum of biomedical applications, e.g., to operate external devices, control the environment, and interact.
An array of electrodes is used to capture the brain signals. Intercranial electrodes can sense electrical signals directly from the brain using up to a few thousand sensors, while extracranial electrodes do the same from outside. The number of electrodes of the latter can vary from 10 to a few 100s. They are mounted in elastic caps, meshes, or rigid grids, to ensure that the captured data are collected from the intended scalp positions.
The captured signals are then amplified and digitised. There are many models of electrode arrays each characterised by the number and quality of electrodes, the quality of the digitisation, the quality of the amplifier, and the sampling rate. Minimisation of the impedance between the electrode surface and the scalp is the primary design criterion for the design of EEG electrodes because the primary loss of signal comes from that impedance.
Electrodes can be invasive and non-invasive: the latter can be wet, semi-dry, and dry depending on the presence of electrolytes at the electrode-skin interface. The electrodes can be passive or active: the latter have a preampliﬁcation stage to reduce the noise from the electrical activity of the environment.
The typical brain signal processing workflow is:
- Brain signals are acquired and pre-processed to enhance signal quality.
- Space and frequency domain features are extracted.
- Classiﬁcation algorithms are applied to the features to decode the user’s mental situation.
Some activities aiming at affecting the brain through physical means are known, e.g., .