Tracking subjects in videos is one of the most widely used functions in camera-based IoT applications such as security surveillance, smart city traffic safety enhancement, vehicle to pedestrian communication and so on. In computer vision domain, tracking is usually achieved by first detecting subjects, then associating detected bounding boxes across video frames. Typically, frames are transmitted to a remote site for processing, incurring high latency and network costs. To address this, we propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements). It leverages a transformer's ability of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi Dataset, a large-scale multimodal dataset in 5 diverse real world scenes, including indoor and outdoor environments. Results demonstrate that ViFiT outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM Encoder-Decoder architecture X-Translator and achieves a high frame reduction rate as 97.76% with IMU and Wi-Fi data.
CITATION STYLE
Cao, B. B., Alali, A., Liu, H., Meegan, N., Gruteser, M., Dana, K., … Jain, S. (2023). ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time Measurements. In ISACom 2023 - Proceedings of the 2023 3rd ACM MobiCom Workshop on Integrated Sensing and Communication Systems (pp. 13–18). Association for Computing Machinery, Inc. https://doi.org/10.1145/3615984.3616503
Mendeley helps you to discover research relevant for your work.