This book deals with the creation of the algorithmic backbone that enables a computer to perceive humans in a monitored space. This is performed using the same signals that humans process, i.e., audio and video. Computers reproduce the same type of perception using sensors and algorithms in order to detect and track multiple interacting humans, by way of multiple cues, like bodies, faces or speech. This application domain is challenging, because audio and visual signals are cluttered by both background and foreground objects. First, particle filtering is established as the framework for tracking. Then, audio, visual and also audio-visual tracking systems are separately explained. Each modality is analyzed, starting with sensor configuration, detection for tracker initialization and the trackers themselves. Techniques to fuse the modalities are then considered. Instead of offering a monolithic approach to the tracking problem, this book also focuses on implementation by providing MATLAB code for every presented component. This way, the reader can connect every concept with corresponding code. Finally, the applications of the various tracking systems in different domains are studied.