Multimedia Forensics: From Statistical Signal Processing to Deep Learning and Explainable Artificial Intelligence – Icarus

DESCRIPTION

The first part of the tutorial deals with the electric network frequency (ENF) as fingerprint for multimedia forensics. The ENF fluctuates instantaneously around its nominal value of 60 Hz in the United States/Canada or 50 Hz in other parts of the world. At any time instant, the ENF exhibits almost the same fluctuation across an interconnected power network. Thus, the ENF signal acquired from any power outlet in such a network during a particular time period can be utilized as a reference signal (i.e., ground truth) to be attested with the ENF extracted from the multimedia recordings.

Both ENF detection and estimation will be addressed leveraging not only Statistical Signal Processing but Deep Learning as well. In the former case, non-parametric and parametric spectral estimation methods are reviewed for ENF extraction from the power mains signal and speech recordings. ENF estimation will be elaborated by alternating between the Least Absolute Deviation (LAD) regression for determining regression weights and objective function minimization with respect to frequency, adapting them within the context of the ℓ1 norm or the sum of ℓ1 norms of the approximation error. This framework is a direct consequence of Laplacian distributed noise. Goodness-of-fit tests are reported, indicating that the Laplacian noise hypothesis is more appropriate than the hypothesis of Gaussian noise in the benchmark ENF-Wuhan University (ENF-WHU) dataset.

Timestamping of audio records will be demonstrated by extracting the ENF signal and correlating the result with ground truth. ENF-based methods for accurate timestamping of audio records using real-world data from various sources will be compared with respect to accuracy.

An effective Convolutional Neural Network (CNN) framework for estimating the ENF will be discussed. The framework is coined as DeepENF. DeepENF achieves state-of-the-art ENF estimation accuracy using a CNN that relies on a single ENF harmonic. This single-harmonic approach simplifies the architecture and reduces the computational cost of DeepENF compared to other ENF estimation techniques with similar performance, which often require multiple ENF harmonics. Additionally, using a single ENF harmonic eliminates the need for fine-tuning to determine the number and combination of harmonics for ENF estimation. These advantages make DeepENF particularly appealing to practitioners. DeepENF is evaluated using benchmark audio recordings from the ENF-WHU dataset, highlighting its proficiency in extracting the ENF signal from possibly noisy observations.

ENF estimation is extended to static and non-static digital video recordings. The estimation exploits areas with similar characteristics in each video frame. These areas, known as superpixels, have a mean intensity that exceeds a specific threshold. Spectral estimation techniques are applied to the time-series of superpixel intensities. The maximum correlation coefficient is employed to measure the accuracy of ENF estimation against the ground truth signal.

Prior to ENF estimation, one should guarantee that the ENF can be detected in the multimedia recordings. A binary statistical hypothesis problem is formulated. Motivated by experimental evidence indicating that the test statistic distribution fits better the Laplacian distribution than the Gaussian one, a LAD-based ENF detector is proposed.

Sensing devices for capturing ENF from power mains as well as light source fluctuations will be demonstrated.

The second part of the tutorial is devoted to source device identification (SDI). The starting observation is that speech signals convey information not only for the speakers’ identity or the spoken language, but also for the acquisition devices used to capture them. This is attributed to the intrinsic traces left behind by the acquisition devices along with their associated signal processing chain. SDI is pivotal in multimedia forensics, as it entails the recognition of the device that captured a specific audio, image or video. The typical methodology to address mobile phone identification from recorded speech signals is to extract features from the entire signal, model the distribution of the features of each phone, and then perform classification on the testing data. Earlier methods for device identification resorted to suitable feature extraction from speech recording spectrogram, feature selection, and their sparse representation. Unsupervised and supervised feature selection procedures will be demonstrated. Experiments will be reported on benchmark databases such as the set of 8 telephone handsets from Lincoln-Labs Handset Database (LLHDB) distributed by the Linguistic Data Consortium as well as our publicly released MOBIPHONE database of 21 mobile phones of various models from 7 different brands. Moreover, we will demonstrate that extracting features from non-speech segments or extracting features from the entire recording and modeling them using a Universal Background Model (UBM) of speech improves classification accuracy. Experimental results will be disclosed on two benchmark datasets, the MOBIPHONE and the Central China Normal University (CCNU) Mobile datasets, demonstrating that non-speech features and UBM modeling yield higher classification accuracy even under noisy recording conditions and amplified speaker variability.

An innovative SDI method using log-Mel spectrograms from video audio, employing an optimized ResNet-based model enhanced with Neural Architecture Search and integrated with Gradient-weighted Class Activation Mapping (Grad-CAM) for insights into influential spectrogram regions will be discussed. A strong emphasis on high-frequency components within the audio data is observed, allowing band-pass filtering to the input spectrograms to selectively retain high-frequency information, resulting in the highest classification accuracy of camera models. Experiments conducted on the VISION dataset, comprising data from 35 different devices, demonstrate the effectiveness of the proposed method in achieving accurate and interpretable SDI, marking the first application of explainable Artificial Intelligence (xAI) techniques resorting to Grad-CAM in this context. Furthermore, a bootstrap analysis is performed to evaluate the classification performance impact of the proposed methodology with and without the integration of Grad-CAM explanations. By assessing the Grad-CAM-driven method, featuring band-pass filtered log-Mel spectrograms against the state-of-the-art approaches, the high accuracy in SDI is illustrated. Moreover, a framework capable of identifying devices using audio, visual content, or a fusion of them will be presented. The fusion of visual and audio content occurs later by applying two fundamental fusion rules: the product and the sum. Experimental evaluation illustrates that the proposed framework exhibits promising classification performance when independently using audio or visual content. Furthermore, although the fusion results don’t consistently surpass both individual modalities, they demonstrate promising potential for enhancing classification performance.

More information and related papers can be found in https://aegis.web.auth.gr under Publications in the Section on News and Media.

DETAILS

Course type: Tutorial

Duration: 3 hours for lectures plus two half-hours for demos, breaks and questions = 4 hours total

Institution of lecturer: Aristotle University of Thessaloniki

Notes: Full detailed slides and bibliography will be provided

Course link: will be provided in due time.

LECTURER

Prof. Constantine Kotropoulos

He was born in Kavala, Greece in 1965. He received the Diploma degree with honors in Electrical Engineering in 1988 and the PhD degree in Electrical & Computer Engineering in 1993, both from the Aristotle University of Thessaloniki. He is currently a Full Professor in the Department of Informatics at the Aristotle University of Thessaloniki. He was a visiting research scholar in the Department of Electrical & Computer Engineering at the University of Delaware, USA during the academic year 2008-2009 and he conducted research in the Signal Processing Laboratory at Tampere University of Technology, Finland during the summer of 1993. He has co-authored 70 journal papers, 225 conference papers, and contributed 9 chapters to edited books in his areas of expertise. He is co-editor of the book “Nonlinear Model-Based Image/Video Processing and Analysis” (J. Wiley and Sons, 2001). His current research interests include audio, speech, and language processing; signal processing; pattern recognition; multimedia information retrieval; forensics and biometric authentication techniques, and human-centered multimodal computer interaction.

Prof. Kotropoulos was a scholar of the State Scholarship Foundation of Greece and the Bodossaki Foundation. He is a senior member of the IEEE and a member of EURASIP, IAPR, and the Technical Chamber of Greece. He was an Area Editor in IEEE Signal Processing Letters and he has been a member of the Editorial Board of the Journals: Advances in Multimedia, International Scholar Research Notices, Computer Methods in Biomechanics and Biomedical Engineering: Imaging and Visualization, Artificial Intelligence Review, MDPI Journal of Imaging, MDPI Signals, and MDPI Methods and Protocols. He was Track Chair for Signal Processing in the 6th Int. Symposium Communications, Control, and Signal Processing, Athens 2014, Program Co-Chair of the 4th Int. Workshop Biometrics and Forensics (IWBF 2016), Limassol, Cyprus, 2016, 1 among the 4 Program Committee Chairs in XXV European Signal Processing Conf. Kos, Greece. 2017, Technical Program Chair of the 5th IEEE Global Conf. Signal and Information Processing, Montreal, Canada, and 1 among the 4 Technical Program Chairs of the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing held in Rhodes, Greece.

Additional information

Kotropoulos Google Scholar https://scholar.google.com/citations?user=c9Dl7qwAAAAJ&hl=en