|
Subject
A significant rise in the availability and consumption of video content can be observed over the recent years with the prominent role of social platforms and streaming services. An interlinked modality to video is audio. A common problem that arises in situations of limited, overloaded, or unreliable bandwidth. The aim of this project is to discover temporal inconsistencies between audio and video (e.g. lagging audio or playback mismatch).
Kind of work
The work to be carried out for this project will involve the use of both video and audio models jointly. The primary goal will be to discover the correspondence (across time) of visual features to those of auditory features. This will be done by jointly encoding visual and audio embeddings to a common feature space and then studying their cyclic consistency over time. The two embeddings are considered consistent (i.e. temporally aligned) if the projections from one embedding directly correspond to embeddings at the supplementary modality at the same time step. The two embeddings can be considered cyclic consistent if consistency holds true for both modalities. Finally, by enforcing cyclic consistency as the main objective, the audio and video can be aligned.
Framework of the Thesis
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P. and Zisserman, A., 2019. Temporal cycle-consistency learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1801-1810). Feichtenhofer, C., 2020. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 203-213). Kazakos, E., Nagrani, A., Zisserman, A. and Damen, D., 2021, June. Slow-fast auditory streams for audio recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 855-859). IEEE.
Number of Students
1
Expected Student Profile
The project requires proficiency is Python with (some) previous experience on Computer Vision. Audio will be processes in the form of spectrograms (i.e. 2D representation of frequency x times) so no previous knowledge on audio recognition is required.
|
|