S7 – FUN-Media – RESTART Foundation

FUN-Media will enable next-generation immersive networked media communications, ensuring the expected QoE, allowing for empathic communications, providing real feel of presence, ensuring the expected content and user authenticity. This is achieved via technology advances in the field of digital twins, multimodal and multisense communications, audio/acoustic user interaction, QoE-aware distribution of trustworthy contents, media generation and representations for humans and machines.

FUN-Media is part of Spoke 4 – Programmable Networks for Future Services and Media

Project PI: Enrico Magli

Main activities carried out to date

Technical advances have been made in several areas, including:

project management and purchases for the Spoke Lab
adaptive metronome algorithms and packet loss concealment for mitigating the impact of latency
methods for detecting audio manipulation
study of the impact of compression and transmission artifacts on dynamic and dense point clouds with subjective tests to explore the users’ QoE with varying combinations of degradations (compression and packet loss)
QoE-aware motion control of a swarm of drones for video surveillance
study of the effect of the adoption of augmented and virtual reality on the quality perceived by the user
learning-based viewport prediction
learning-based compression schemes based on diffusion models
methods for network sparsification and quantization
compression of point clouds and light fields
an approach to asynchronous federated continual learning
definition of the Human CyberTwin to support the management of the QoE
biometrics and related compression techniques.

Most significant results

The project has already generated several practical outcomes, many of which have been consolidated in scientific publications.

These includes:

a content-aware compression and transmission method for automotive Lidar data
a continual learning method for semantic image segmentation
methods for detection of synthetic and manipulated speech
a method for deepfake detection
a method for viewport prediction
a federated continual learning method
a study on the impact of VR on user attention.
stress assessment for AR based on head movements
identification of the leading sensory cue in mulsemedia VR
a VR dataset for network and QoE studies
an aerial multimodal dataset with network measurements and perception data.

Several of these methods are expected to lead to technologies exploitable by the industry during the course of the project, as the related use cases have been chosen in such a way as to be relevant for the market.

The most significant results achieved so far in the field of audio processing are the following

1) Immersive Networked Music Performance based on 5G. One significant result is the exploration of the integration of 5G technology into Networked Music Performances (NMPs) highlighting the need of immersive audio integration and low-latency, high-reliability communication technologies. Two novel architectures have been introduced (embedded computing and MEC-based processing) and designed to meet the stringent requirements of Immersive NMPs, leveraging on the capabilities of 5G networks, including SDN, MEC, and network slicing. These results offer promising solutions for enabling remote, immersive musical performances over 5G networks, paving the way for more accessible and innovative forms of musical collaboration.

2) Hybrid Packet Loss Concealment Methods. Another significant achievement is the development of innovative Packet Loss Concealment methods specifically designed for both music and speech signals in remote interactive applications. These advanced methods feature a parallel structure that combines a linear predictive coder (LPC) branch with a neural network (NN) branch. This dual approach takes advantage of both traditional signal processing and machine learning techniques. In the case of music signals, these methods have shown superior performance compared to state-of-the-art solutions, representing a significant step forward in the field. These methods are particularly promising due to their potential for seamless integration into remote interaction applications. When properly implemented, they can greatly enhance audio quality, even when used with low-cost setups, making high-quality audio experiences more accessible across various platforms and environments.

3) Physics-informed neural network for volumetric audio. We introduced a novel approach for volumetric sound field reconstruction using Physics-Informed Neural Networks (PINNs). This approach allows one to integrate physical wave equation directly into the neural network, allowing the model to reconstruct sound fields with high accuracy even when using fewer microphones and lightweight architectures. This technology is especially relevant for 6 Degrees of Freedom (6DoF) applications, where users can move freely in a 3D space and experience sound from different perspectives. Reconstructing sound fields with fewer microphones reduces equipment costs in fields like VR, AR, and audio production. As 6DoF experiences grow, this technology can attract new audiences to classical music and other cultural areas through immersive, interactive soundscapes. PINNs also enable the digital preservation of unique acoustics from historic sites, like concert halls, allowing their sound environments to be archived and recreated in virtual spaces, preserving musical heritage even if the physical structures change.

The most significant results in the field of media networking and security are the following:

1) A novel algorithm based on non-linear model predictive control has been designed to control and coordinate the motion of a swarm of drones streaming to a ground control station the videos captured by onboard downward-facing cameras to patrol a given area following a desired path. In order to improve situational awareness, the proposed algorithm coordinates the swarm in such a way that

1) the field of view of the cameras overlap by a given percentage to allow video stitching operations at the GCS;
2) the motion of the drones should react to network bandwidth variations so to improve the visual quality of the received videos. For what concerns the societal impact, we argue that the proposed solution can improve security and surveillance capabilities, enabling efficient patrolling of large areas with relevant applications such as disaster response, urban monitoring, thus improving safety and response times. From the point of view of the economic impact, the solution allows to reduce costs associated with manual surveillance and patrolling, as fewer personnel are required to monitor large or dangerous zones. Furthermore, it opens up new opportunities for industries which rely on drone technology among which we cite precision agriculture, where field coverage and real-time data can enhance productivity and decision-making processes.

2) One of the project’s significant results is the development of an innovative audio splicing detection and localization technique based on acquisition device traces. This technique addresses the growing concern of audio manipulation, particularly through splicing speech segments from different sources to create fake or misleading content. By focusing on the unique traces left by different recording devices, the method utilizes a Convolutional Neural Network (CNN) to extract model-specific features, enabling the detection and localization of spliced segments. The impact of this achievement is significant both socially and economically. Socially, it strengthens the fight against misinformation and audio-based forgeries, protecting individuals, organizations, and governments from deceptive audio manipulations that could alter public perception or cause harm. Economically, the technique can be employed by media companies, legal institutions, and forensic investigators to authenticate recordings, reducing risks associated with fraud, defamation, and intellectual property violations, while enhancing trust in digital content.

The most significant results in the field of secure compressed representations are the following:

1) a new JPEG AI framework for compressed domain computer vision applications that have been proven to effectively handle face detection tasks. JPEG AI is a learning-based image codec able to perform computer vision tasks directly on the latent representation and that will attain International Standard status by October 2024. The developed framework combines JPEG AI and a bridge neural network architecture to efficiently perform face detection in a single scale scenario. Ongoing work is underway to extend the proposed framework to multi-scale scenarios.

2) the maximum tolerable distortion on electroencephalography (EEG) signals due to lossy compression has been identified. Several compression techniques for physiological biometric signals, with particular emphasis on EEG, have been investigated and a preliminary work has been conducted, in collaboration with the Working Group 32 (WG-32) of the Digital Imaging and Communications in Medicine (DICOM) standards committee, with the aim of determining the maximum distortion due to lossy compression that can be tolerated on EEG signals. As a result, it was concluded that a percentage root mean square difference (PRD) of 5% between original and reconstructed EEG signals can be accepted by clinicians and experts.

3) the use of wearable devices to perform continuous automatic people recognition during a metaverse session has been investigated. In particular, generative approaches have been proposed and investigated to convert inertial data into electrical measurements of heart activity with the main aim of developing reliable recognition systems.

Societal Impact:

introducing novel and reliable uses of wearable devices would result in multiple benefits, spanning from providing the tools to perform continuous recognition of users during any activity (including experiences in the metaverse), to allowing non-invasive monitoring of heart activity and of the associated health status. Relying on inexpensive consumer devices to do so would make the aforementioned applications more accessible for the potential users.

Economic impact:

Increasing the reliability of the recognition phase of metaverse applications would boost their acceptability and foster their widespread adoption. Introducing novel usages of wearable devices will likely increase their demand and encourage the development of the related technologies, driving innovation and growth in this sector.

Papers:

Daniele Ugo Leonzio, Luca Cuccovillo, Paolo Bestagini, Marco Marcon, Patrick Aichroth, Stefano Tubaro, "Audio Splicing Detection and Localization Based on Acquisition Device Traces", IEEE Transactions on Information Forensics and Security, 2023

M. Mel, A. Gatto, P. Zanuttigh, "Joint Reconstruction and Spatial Super-resolution of Hyper-Spectral CTIS Images via Multi-Scale Refinement", IEEE Transactions on Computational Imaging, 2024

Marco Olivieri, Amy Bastine, Mirco Pezzoli, Fabio Antonacci, Thushara Abhayapala, Augusto Sarti, "Acoustic Imaging With Circular Microphone Array: A New Approach for Sound Field Analysis", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2024

Project innovations

The project develops several technologies that can be the subject of industry collaboration and exploitation. We currently have two active lines of collaboration with industry. One of the project partners, Wind3, provides the business point of view about WP3 activities and WP4 scope, and highlights synergies between different network elements. Moreover, with the cascade calls we have a new industry partner (Xenia Progetti) which will help defining and demonstrating a use case of employment of digital twins for networked music performance.

The use case implemented in the MEET Metaverse project will provide evidence of user emotions, which are detected by an external system and, through this, influence and modify the environment/ interaction among users. The input emotions will be simulated in the first phase of the project to then, possibly, be replaced by the emotion analyzer/provider developed by the partners.

To reach the target of providing evidence of users emotions, two techniques have been identified:

Applying poses and expressions to Meta Avatars;
Managing the order of the songs in the playlist.

Project partners

Program partners:

Cascade calls partners:

Highlights

Recent research is related to restoration of audio signals with missing parts. We have developed methods based on artificial intelligence to recover those parts and provide near-perfect playback quality. We have also addressed the problem of audio splicing, where a malicious user might replace parts of an audio with other ones by the same speaker, altering the meaning of what is being said; Ai can tell us if manipulations have been applied.

Project Progress: KPIs

Publications Total number of publications (including journals and conference papers)

Expected: 36
Accomplished: 70
Readiness: 100%

Joint publications (at least 30% of total number of publications)

Expected: 12
Accomplished: 8
Readiness: 66%

Talk, dissemination and outreach activities (does not include conference presentations)

Expected: 9
Accomplished: 11
Readiness: 100%

Demo/PoC

Expected: 5 PoCs by the end of the project
Accomplished: 0
Readiness: 0% (work according to plan, as demo/PoCs are expected starting from the second year of the project).

Project meetings

Expected: 75
Accomplished: 96
Readiness: 100%

Patents/Innovations

Expected: 5 PoCs by the end of the project
Accomplished: 0
Readiness: 0% (work according to plan, as demo/PoCs are expected starting from the second year of the project).

Open source contributions

Expected: 0% (no open source contribution necessarily expected).
Accomplished: 4
Readiness: 100%

Standardization contributions

Expected: 0
Accomplished: 0
Readiness: 0% (work according to plan, no standardization contribution necessarily expected).

Project Progress: Milestones

M1.1 First release of exploitation, dissemination and impact

Expected M12
Accomplished M12
readiness 100%

M1.2 Second release of exploitation, dissemination and impact monitoring monitoring

Expected M24
Accomplished M12
readiness 100%

M1.3 Third release of exploitation, dissemination and impact monitoring

Expected M36
Accomplished M12
readiness 33%

M3.1 First release of audio and acoustic signal processing system

Expected M12
Accomplished M12
readiness 100%

M3.2 Advanced release of audio and acoustic signal processing system

Expected M24
Accomplished M12
readiness 100%

M3.3 Release of proof-of-concept of audio and acoustic signal processing system

Expected M36
Accomplished M12
readiness 33%

M4.1 First release of experience-aware distribution system for authentic contents

Expected M12
Accomplished M12
readiness 100%

M4.2 Advanced release of experience-aware distribution system for authentic contents

Expected M24
Accomplished M12
readiness 100%

M4.3 Release of proof-of-concept of experience-aware distribution system for authentic contents

Expected M36
Accomplished M12
readiness 33%

M6.1 First release of innovative media generation and representation system

Expected M12
Accomplished M12
readiness 100%

M6.2 Advanced release of innovative media generation and representation system

Expected M24
Accomplished M12
readiness 100%

M6.3 Release of proof-of-concept of innovative media generation and representation system

Expected M36
Accomplished M12
readiness 33%

Researchers involved: 222 Person-Monts

Collaboration proposals:

Provisional list (contact project PI for more info):

a collaboration on networked music performance, which allows musicians to collaborate and perform together in real-time, transcending geographical boundaries. The objective is to develop a more seamless and engaging collaborative musical experience;
a collaboration on efficient viewport-based algorithms for omnidirectional video streaming systems, employing machine learning methods and taking advantage of saliency information;
a collaboration on deepfake detection models for visual information employing deep neural networks;
a collaboration on neural radiance fields and Gaussian splatting for scene rendering;
a collaboration on low-complexity (e.g. binary) neural networks for inference and compression on embedded devices;

For any proposal of collaboration within the project please contact the project PI.

FUN-Media News:

RESTART cascade calls: results of Spoke 4 calls have been published