The PKU-DAVIS-SOD dataset is a large-scale multimodal neuromorphic object detection dataset with some challenging scenarios (i.e., low-light and high-speed motion blur) included. It is constructed by the National Engineering Research Center for Visual Technology, Peking University.

Collection Setup. This dataset is recorded using DAVIS346 which is one of the novel event cameras. As shown in Fig.1(a), we install a DAVIS346 camera on the front windshield of the driving car. For the convenience of acquiring high-speed objects meanwhile providing a comprehensive perspective of the objects, we additionally provide some sequences in which the camera is set at the side of the road, recording objects from the flanks. The DAVIS346 camera shown in Fig.1(b) can simultaneously outputs high temporal resolution asynchronous events and conventional RGB frames with the resolution of 346 ´ 260.

This image has an empty alt attribute; its file name is image-2.png
(a) Recording platform(b) DAVIS346 camera
Fig.1 Setup of data collection

Data recordings and Annotation. Our PKU-DAVIS-SOD dataset contains 3 traffic scenarios by considering velocity distribution, light condition, category diversity and object scale (see Fig. 2), etc. We use the DAVIS346 to record 220 sequences including RGB frames and DVS events. In each sequence, we collect approximately 1 min as the raw data pool with 25 FPS of RGB frames. To provide manual bounding boxes in challenging scenarios (e.g., high-speed and low-light), grayscale images are reconstructed from asynchronous events using E2VID at 25 FPS when RGB frames are of low quality. After the temporal calibration, we first select three common and important object classes (i.e., car, pedestrian, and two-wheeler) in our daily life. Then, all bounding boxes are annotated from RGB frames or synchronized reconstructed images by a well-trained professional team.

(a) Category diversity(b) Light Change(c) Object scale(d) Velocity distribution
Fig.2 Representative examples.

Data Statistics. Manual annotations in the recordings are provided at a frequency of 25 Hz. As a result, this dataset has 276k labeled timestamps and 1080.1k labels in total. Afterward, we split them into 671.3k for training, 194.7k for validation, and 214.1k for testing. The precise numbers can be found in Table 1.

Table 1 The details of the PKU-DAVIS-SOD dataset.


  1. DVS events, APS frames, and the corresponding annotation results can only be used for ACADEMIC PURPOSES. No COMERCIAL USE is allowed.
  2. Copyright @ National Engineering Research Center for Visual Technology and Institute of Digital Media, Peking University (PKU-IDM). All rights reserved.


You can download directly from here.

Address: Room 2604, Science Building No.2, Peking University, No.5 Yiheyuan Road, Haidian District, Beijing, P.R.China.

Fax: +86-10-62755965.


The PKU-Retina-Recon dataset is constructed by National Engineering Laboratory for Video Technology (NELVT), Peking University. The goals to create the PKU-Retina-Recon dataset include:

  1. providing the worldwide researchers of the neuromorphic vision community a spike/event dataset for evaluating their algorithms;
  2. facilitating the development of reconstruction technologies by providing several spike sequences with different motion speed or at different light conditions.

Therefore, the PKU-Retina-Recon dataset is now partly made available for the academic purpose only on a case-by-case basis

The PKU-Retina-Recon dataset is constructed by National Engineering Laboratory for Video Technology (NELVT), Peking University, sponsored by the National Basic Research Program of China and Chinese National Natural Science Foundation. The NELVT at Peking University is serving as the technical agent for distribution of the dataset and reserves the copyright of all the sequences in the dataset. Any researcher who requests the PKU-Retina-Recon dataset must sign this agreement and thereby agrees to observe the restrictions listed in this document. Failure to observe the restrictions will result in access being denied for the request of the future version of the PKU-Retina-Recon dataset and being subject to civil damages in the case of publication of sequences that have not been approved for release.


  •  The spike sequences for download are part of the PKU-Retina-Recon dataset.
  •  The sequences can only be used for ACADEMIC PURPOSES. NO COMERCIAL USE is allowed.
  •  Copyright © National Engineering Laboratory for Video Technology (NELVT) and Institute of Digital Media, Peking University (PKU-IDM). All rights reserved.

All publications using PKU-Retina-Recon should cite the papers below:

Lin Zhu, Jianing Li, Xiao Wang, Tiejun Huang, Yonghong Tian; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2400-2409


You can download the agreement(pdf) by clicking the DOWNLOAD link . After filling it, please send the electrical version to our Email: pkuml at (Subject: PKU-Retina-Recon Agreement) .


The team has made extraordinary contributions on the compression and analysis of visual big data, neuromorphic vision, and brain-inspired computation, with over 200 peer reviewed top-tier journal and conference publications, more than 80 patents granted and 20 patents pending, two competitive best paper awards, and six national/ministerial awards. Dozens of core techniques have been adopted into international and national standards, and commercial products. Four representative achievements are briefly presented as follows:

1 Neuromorphic Vision Chips and Devices

To address the visual imaging challenges in high-speed motion, the team invented a retina-like integral visual imaging and reconstruction technology, inspired by the findings from the simulation of the ten-million-scale neural networks of a monkey retinal fovea on a supercomputer. Moreover, their team also creatively developed a high-efficient compression framework of visual spike streams and designed a fulltime fovea-like visual sensing chip with the sampling frequency of 40,000 Hz and the spike camera with the same photosensitive devices as traditional cameras. That is, by utilizing only consumer-level CMOS sensors and integrated circuits, a camera with the fovea-like visual sensing chip, called Vidar, is 1,000x faster than a conventional camera.

This image has an empty alt attribute; its file name is image.png

The Vidar camera is able to reconstruct the image at any given moment with considerable flexibility in dynamic range. Currently, the team is developing a large field-of-view neuromorphic vision measurement device that can effectively and efficiently detect the high-speed moving objects such as tennis, bullets, and missiles.

2 Visual Big Data Technologies

The team has made a seminal contribution to developing innovative video analytic algorithms and systems for visual big data applications. They pioneered the visual saliency computation approach uniquely from a machine learning perspective and creatively designed a fine-grained object recognition scheme at the very early stage, both inspired by the neurobiological mechanisms of human vision system. This work promoted the learning-based visual saliency computing and fine-grained object recognition to become the mainstream research topics in this field. With these innovative video analytic algorithms, he invented a new scalable visual computing framework to tackle the visual big data challenges, called digital retina, to revolutionize the old one that was formed fifteen years ago due to cloud computing. The team also developed a digital retina server and a city brain system to enable the framework. This framework is shown practically to be effective to address the challenge when aggregating video streams from hundreds of thousands of cameras distributed geographically into the cloud for big data analysis.

This image has an empty alt attribute; its file name is image-1.png

Some of their algorithms and systems have been commercially transferred and then applied to urban video surveillance systems and transportation systems in some large and medium-sized cities such as Shenzhen, Qingdao and Guiyang. For instance, their system successfully assisted the police to track down a case in which a faked-plate Volvo car maliciously violated the traffic regulations in Qingdao 433 times one year. Moreover, their system could provide more precise and more timely sensing of the traffic status of the city so that the city brain can adopt the corresponding optimization strategy to reduce the traffic congestion and improve the traffic flow.

3 Scene-based Video Coding Technology and Standards

In conventional research, video coding was often treated as a signal processing problem with ideal mathematical settings, which inevitably ignored many realistic characteristics of scene videos, e.g., the redundant background in surveillance video and conference video. To incorporate such scene characteristics, the team creatively embedded a low-complexity and high-efficiency background modeling module into the video coding loop and established a novel standard-compatible scene-based video coding framework. This framework can achieve approximately twice the coding efficiency on surveillance videos while remarkably reducing the encoding complexity, for all the standard reference software, including H.264/AVC and H.265/HEVC. More importantly, foreground objects can be extracted simultaneously and represented as regions of interest (ROIs). These data are directly used by intelligent analysis tasks such as object detection and tracking. In other words, the video coding becomes more analysis-friendly with enhanced support to visual content interaction and more suitable for video analysis.

This image has an empty alt attribute; its file name is image-4.png

In the last several years, the scene-based video coding technology became the core of IEEE 1857 standard and China’s AVS standard. Hikvision, the leading supplier of video surveillance equipment worldwide, and Hisense, the No. 1 vendor for urban traffic and public transport in China, have embedded this standardized technology into their smart camera products and intelligent transportation systems, making them earn billions of sales revenue in the rapidly-growing market.

4 Pengcheng CloudBrain and its AI Open-Source Platform

In recent two years, as the Chief Architector, Dr. Tian is leading the development of the Pengcheng Cloudbrain, one of the leading AI supercomputers in China for academic research. The Cloudbrain-I consists of 1000+ NOVIDA V100 GPUs and self-developed resource management and scheduling software (called Octopus). While the Cloudbrain-II contains 2048 Kunpeng CPU and 4096 Ascent NPUs developed by Huawei, with totally 2 PFlops for FP 64 and 1EFlops for FP 16. In Nov 2020, it ranked the first place for both the full-system and 10-nodes configurations at the IO500, a comprehensive benchmark suite that enables comparison of high-performance storage and IO systems. Meanwhile, it also won the AIPerf benchmark in Nov 2020, an end-to-end benchmark suite utilizing automated machine learning (AutoML) that represents real AI scenarios for supercomputers.

This image has an empty alt attribute; its file name is image-3.png

The Cloudbrain has been currently used in different challenging applications, e.g., to train large-scale pre-train NLP models such as GPT-3 and its extended versions, or to implement the city-level traffic situation awareness from thousands of traffic cameras. It has also been constructed as an open source infrastructure and shall serve the needs of AI researchers and technologists nationwide within China and eventually shall open to the worldwide community.

Surveillance Video: The Biggest Big Data

This position article is cited from T. Huang, “Surveillance Video: The Biggest Big Data,” Computing Now, vol. 7, no. 2, Feb. 2014, IEEE Computer Society [online];


Big data continues to grow exponentially, and surveillance video has become the largest source. Against that backdrop, this issue of Computing Now presents five articles from the IEEE Computer Society Digital Library focused on research activities related to surveillance video. It also includes some related references on how to compress and analyze the huge amount of video data that’s being generated.

Surveillance Video in the Digital Universe

In recent years, more and more video cameras have been appearing throughout our surroundings, including surveillance cameras in elevators, ATMs, and the walls of office buildings, as well as those along roadsides for traffic-violation detection, cameras for caring for kids or seniors, and those embedded in laptops and on the front and back sides of mobile phones. All of these cameras are capturing huge amounts of video and feeding it into cyberspace daily. For example, a city such as Beijing or London has about one million cameras deployed. Now consider that these cameras capture more in one hour than all the TV programs in the archives of the British Broadcasting Corporation (BBC) or China Central Television (CCTV). According to the International Data Corporation’s recent report, “The Digital Universe in 2020,” half of global big data — the valuable matter for analysis in the digital universe — was surveillance video in 2012, and the percentage is set to increase to 65 percent by 2015.

To understand about R&D activities related to video surveillance, I searched the keywords video and surveillance in IEEE Xplore (within metadata only) and the IEEE CSDL (by exact phrase). The search results showed 6,832 (in Xplore) and 3,111 (in CS Digital Library) related papers published in IEEE conferences, journals, or magazines. Figure 1 shows the annual histogram of these publications. Obviously, the sharp increasing in the past ten years indicates that research on surveillance video is very active.

Figure 1. Histogram of publications in IEEE Computer Society Digital Library and IEEE Xplore for which metadata contains the keywords video and surveillance. Note: “~1989” shows all articles up to 1989. The numbers for 2013 might also increase as some are still waiting to be archived into the database.

Theme Articles

Surveillance-video big data introduces many technological challenges, including compression, storage, transmission, analysis, and recognition. Among these, the two most critical challenges are how to efficiently transmit and store the huge amount of data, and how to intelligently analyze and understand the visual information inside.

Higher-efficiency video compression technology is urgently needed to reduce the storage and transmission cost of big surveillance data. The state-of-the-art High Efficiency Video Coding (HEVC) standard, featured in the October 2013 CN theme, can compress a video to about 3 percent of its original data size. In other words, HEVC doubles the data compression ratio of the H.264/MPEG-4 AVC approved in 2003. In fact, the latter doubled the ratio of the previous-generation standards MPEG-2/H.262, which were approved in 1993. Despite these advances, this doubling of video-compression performance every ten years is too slow to keep pace with the growth of surveillance video in our physical world, which is now doubling every two years, on average!

To achieve a higher compression ratio, the unique characteristics of surveillance video must be factored into the design of new video-encoding standards. Unlike standard video, for instance, surveillance footage is usually captured in a specific place day after day, or even month after month. Yet, previous standards fail to account for the specific residuals that exist in surveillance video (for example, unchanging backgrounds or foreground objects that appear many times). The new IEEE std 1857, entitled Standard for Advanced Audio and Video Coding, contains a surveillance profile that can further remove background residuals. The profile doubles the AVC/H.264 compression ratio with even lower complexity. In “IEEE 1857 Standard Empowering Smart Video Surveillance Systems,” Wen Gao, our colleagues, and I present an overview of the standard, highlighting its background-model-based coding technology and recognition-friendly functionalities. The new approach is also employed to enhance HEVC/H.265 and nearly double its performance as well. (Additional technical details can be found in “Background-Modeling Based Adaptive Prediction for Surveillance Video Coding,” which is available to subscribers via IEEE Xplore.)

Much like the physical universe, the vast majority of the digital universe is so-called digital dark matter — it’s there, but what we know about it is very limited. According to the IDC report I mentioned earlier, 23 percent of the information in the digital universe would be useful for big data if it were tagged and analyzed. Yet, technology is far from where it needs to be, and in practice, only 3 percent of potentially useful data is tagged — and even less is currently being analyzed. In fact, people, vehicles, and other moving objects appearing in millions of cameras will be a rich source for machine analysis to understand the complicated society and world. As guest editor Dorée Duncan Seligmann discussed in CN’s April 2012 theme, video is even more challenging than other data types for automatic analysis and understanding. This month we add three articles on the topic that have been published since then.

Human beings are generally the major objects of interest in surveillance video analysis. In the best paper from the 2013 IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), “Reference-Based Person Re-identification” (available to IEEE Xplore subscribers), Le An and his colleagues propose a reference-based method for learning a subspace in which the correlations among reference data from different cameras are maximized. From there, the system can identify people who are present in different camera views with significant illumination changes.

Human behavior analysis is the next step for deeper understanding. Shuiwang Ji and colleagues’ “3D Convolutional Neural Networks for Human Action Recognition” introduces the deep learning underlying human-action recognition. The proposed 3D convolutional neural networks model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent video frames. Experiments conducted using airport videos achieved superior performance compared to baseline methods.

In “Monocular Visual Scene Understanding: Understanding Multi-Object Traffic Scenes,” Christian Wojek and his colleagues present a novel probabilistic 3D scene model that integrates geometric 3D reasoning with state-of-the-art multiclass object detection, object tracking, and scene labeling. This model uses inference to jointly recover the 3D scene context and perform 3D multi-object tracking, using only monocular video as input. The article includes an evaluation of several challenging sequences captured by onboard cameras, which illustrate that the approach shows substantial improvement over the current state of the art in 3D multiperson tracking and multiclass 3D tracking of cars and trucks on a challenging data set.

Toward a Scene Video Age

This month’s theme also includes a video from John Roese, the CTO of EMC Corp., with his technical insight on this topic.

Much like surveillance, the number of videos captured in classrooms, courts, and other site-specific cases is increasing quickly as well. This is the prelude to a “scene video” age in which most videos will be captured from specific scenes. In the near future, these pervasive cameras will cover all the spaces the human race is able to reach.

In this new age, the ‘scene’ will become the bridge to connect video coding and computer vision research. Modeling these scenes could facilitate further video compression as demonstrated by the IEEE 1857 standard. And then, with the assistance of such scene models encoded in the video stream, the foreground-object detection, tracking, and recognition becomes less difficult. In this sense, the massive growth in surveillance and other kinds of scene video presents big challenges, as well as big opportunities, for the video- and vision-related research communities.

In 2015, the IEEE Computer Society’s Technical Committee on Multimedia Computing (TCMC) and Technical Committee on Semantic Computing (TCSEM) will jointly sponsor the first IEEE International Conference on Multimedia Big Data, a world premier forum of leading scholars in the highly active multimedia big data research, development and applications. Interested readers are welcome to join us at this new conference next spring in Beijing for more discussion on the rapidly growing multimedia big data.



  • 应用维度来看,视频感知技术与系统在安防、交管上有初步成功应用,但在城市运行(如城市范围内的运行态势分析)和社会服务(如重点景区的实时直播、家庭看护)方面才刚起步。因此,如何从“管、控”到“服务”是视频大数据应用发展的重点。
  • 数据维度来看,基于单摄像头的视频感知分析技术与系统已初步可应用。一个典型的例子就是交通电子警察的监控与违章发现。然而,随着城市视频监控系统规模的不断扩大和应用需求的爆炸式增长,处理跨摄像头视频数据、大规模摄像头网络数据、甚至融合各类视频图像及关联数据的视频大数据就成为当务之急。
  • 技术维度来看,过去20年已基本解决监控视频摄像系统的数字化问题,近五年开始解决大规模监控摄像头的高清化问题。然而在现阶段,大量布设的监控摄像节点的智能程度低,信息不能实时处理,从而造成以人工监测为主的应用模式效率低下,对突发事件的反应缓慢。因此,视频监控的智能化是今后相当长一段时间内的研究与应用重点。更进一步,由于目前仍缺少广域范围内多摄像机视频数据的协同处理与计算技术,因而难以深度利用和挖掘这些广域视频中丰富的人、物、行为乃至事件信息。因此大数据智能化(简称大数据化)是未来若干年的必然研究与发展趋势。

1 视频大数据的主体是监控视频,但广义的视频大数据还包括会议视频、家庭视频、教学视频、法庭视频等,它们往往采用固定摄像头来对某个特定的场景(如家庭、教室、会议室、法庭、交通道口等)在一段时间内进行拍摄。由于这类视频都有较为固定的场景特性,场景的拍摄并没有预定义的剧本或意图,我们称之为“场景视频”(Scenic Video)。与其他类型视频(如新闻视频、影视视频、体育视频)相比,场景视频是机器视觉研究的最可能突破口,也是解决视频大数据问题的最佳实验数据。首先,场景视频是通过摄像头长期注视单一场景而采集的视频,场景中的“变”与“不变”要素能有效地进行建模、分析与推理,从而有可能达到高效压缩、场景理解甚至精确识别;其次,场景视频的体量巨大,是视频大数据中的主要研究对象,因此场景视频处理与分析技术的发展能直接推动视频大数据的技术水平。
从更大的视角来看,信息技术领域正在孕育重大技术突破与变革,视频大数据是主要的推动力之一及最佳的应用问题。一方面,认知与脑科学、机器视觉等相关领域正在发生从“量变”到“质变”的过程,预期能在未来若干年内获得理论上的突破。另一方面,近一两年来以深度学习和电子大脑为代表的人工智能技术获得了迅猛的发展,产生了包括Google Brain、百度大脑、IBM的类人脑芯片TrueNorth等一批标志性的阶段成果。这些成果将为场景视频的处理与分析提供强有力的计算与处理能力。

Research Projects

Smart TV

Recent representative results are briefly described as follows:

1)   SalAd: Saliency-driven video advertising system


2)   Cloud computing and smart TV are listed as the two most important emerging technologies in the recent years. By seamlessly integrating the emerging cloud computing and smart TV, the new multimedia service platform is able to trigger the next round of revolution in the fields of digital home services, consequently opening up a whole new world for the entire TV industry and interactive media industry. To address this challenge, this project focuses on key technologies and applications of multimedia cloud services and smart TV clients for three terminals (including TV, Tablet PC and mobile phones).

easytv1  easytv2

3)   All these studies can offer important avenues for the multimedia academic community and have great potential of commercial values in digital TV, content and entertainment industries.


4)   VLabler: Sequence multi-labeling system for video annotation.


5)   Obj!CSM: Object segmentation system based on complementary saliency maps.


6)    C-VideoAdvisor: Video advertisement automatic association system.




This project puts its main focus on the challenging research issues and key technologies about multi-camera cooperated object detection, tracking, and event analysis on large-scale surveillance video data. Overall, the long-term objective of this project is to provide key technologies and solutions for the next-generation intelligent video surveillance systems and applications.

Recent representative results are briefly described as follows:

1) Object detection and tracking

obj_dt  obj_dt2

2) Event detection

ed1 ed2


3) Multi-view human detection



4) ESur: Event detection system on surveillance videos



5) XSur: surveillance object localization and tracking system



6) Fire and smoke detection for forest and city videos

fireandsmoke1   fireandsmoke2

7) BVPMeasure: Automatic Webcam-based Human Heart Rate Measurements




In this project, our objective is to develop a series of high efficient video coding models and methods for surveillance applications. Our coding methods are designed by using the special characteristics of surveillance videos to achieve higher coding performance compared with the existing coding standards which are developed for coding generic videos.

Recent representative results are briefly described as follows:

1) BDC: Background-difference-based coding

We proposed an efficient solution called background-difference-based coding (BDC). BDC follows the traditional hybrid coding framework, but utilizes the original input frames to generate and encode the periodically updated background frame. After that, it calculates the difference frames by subtracting the reconstructed background frame from the input frames, and then codes these difference frames into the code stream.


2) AVS Transcoder: A fast and performance-maintained transcoding system

  •  High-efficiency: 2~10 times of H.264 HP
  •  Low transcoding complexity: About 5% of the state-of-the-art encoder


Research Funding

Currently, our team is undertaking more than 10 major national academic projects, including a 973 project, a key project from National Natural Science Foundation, a key project from National High-Tech Research and Development (863) Programme and a key project from the Key Technologies R&D Programme. In recent years, the team won in several international competitions and was awarded the First Prize of Science and Technology Progress Awards 2009 set by Ministry of Education and the Second Prize of National Science and Technology Progress Awards 2010.

Some of our funding are as follows:

  • Multi-camera Cooperative Moving Object Detection, Tracking and Anomalous Behavior Analysis in Surveillance Video (A Key Project Grant from NSFC, 2011-2014).
  • Learning-based Video Attention & Interestingness Computational Methodology for Interative Video Technology (Grant from NSFC, 2010-2012)
  • Theory and Metholodogies of the Correlation Analysis on Salient Moving Objects in the Multi-view Surveillance Video (Grant from M.O.E of China, 2010-2012).
  • Robust Statistical Relational Models and Relational Kernel Methods for Complex Link Data and Relational Data (Grant from NSFC, 2007-2009).
  • Video Retrieval Technology and Content Management System for IPTV (Grant from China 863 Hi-Tech Program, 2007-2008).
  • Content Analysis and Enrichment Technology for IPTV Interactive Video Services (Grant from Huawei Company, 2008-2009).

Research Areas

Our current research activities focus on two areas:

1. Brain-like and Deep Computing 

To investigate the new-generation intelligent computing system by biologically simulating human vision system and developing brain-like computation models. In particular, we focus on the following topics:

  • Neural inversion computing
  • Deep learning for video analysis
  • Biologically simulating for human vision system

1.1 Neural inversion computing

  • Building deep neural networks to reveal sensing and processing mechanisms of thehuman visual system (e.g., encoding)
  • Proposing brain-inspired novel visual sensing models and efficient spiking neural network models

Related Papers:

  1. Zhaofei Yu, Jian K. Liu*, Shanshan Jia, Yichen Zhang, Yajing Zheng, Yonghong Tian, Tiejun Huang, Towards the Next Generation of Retinal Neuroprosthesis: Visual Computation with Spikes, Engineering, Volume 6, Issue 4, April 2020, 449-461.
  2. Yichen Zhang, Shanshan Jia, Yajing Zheng, Zhaofei Yu*, Yonghong Tian, Siwei Ma, Tiejun Huang, Jian K. Liu*, Reconstruction of Natural Visual Scenes from Neural Spikes with Deep Neural Networks, Neural Networks, Volume 125, May 2020, Pages 19-30.
  3. Qi Yan, Yajing Zheng, Shanshan Jia, Yichen Zhang, Zhaofei Yu*, Feng Chen, Yonghong Tian, Tiejun Huang, Jian K. Liu*, Revealing Fine Structures of the Retinal Receptive Field by Deep Learning Network, IEEE Transactions on Cybernetics. 10.1109/TCYB.2020.2972983
  4. Yajing Zheng, Shanshan Jia, Zhaofei Yu*, Tiejun Huang, Jian K. Liu, Yonghong Tian*, Probabilistic Inference of Binary Markov Random Fields in Spiking Neural Networks through Mean-field Approximation, Neural Networks, Volume 126, June 2020, Pages 42-51.

1.2 Deep learning for video analysis

Related Papers:

  1. Zhengying Chen, Yonghong Tian*, Wei Zeng and Tiejun Huang, Detecting Abnormal Behaviors in Surveillance Videos Based on Fuzzy Clustering and Multiple Auto-EncodersProc. Int’l Conf. Multimedia and Expo (ICME 2015), Torino, Italy.
  2. Yemin Shi, Wei Zeng, Tiejun Huang, Yaowei Wang∗, Learning Deep Trajectory Descriptor for Action Recognition in Videos using Deep Neural NetworksProc. Int’l Conf. Multimedia and Expo (ICME 2015), Torino, Italy.
  3. Jilong Zheng, Yaowei Wang, Wei Zeng, and Yonghong Tian, CNN Based Vehicle Counting with Virtual Coil in Traffic Surveillance VideoProc. IEEE Int’l Conf. Multimedia Big Data (BigMM 2015), Apr 2015, Beijing, China, 280-281.

1.3 Biological simulating for human vision system

  • Modeling the neurons and circuits in the retina and primary visual cortex of primate (Macaque monkey), via detecting the response/output of the retinal ganglion cells and neurons in the shallow layers of V1 to the visual stimulus pattern;
  • Developing software emulating primate retina, LGN and V1, to implement its coding functionality as accurate as possible.


2. Multimedia Big Data

To address the technological challenges introduced by multimedia big data, including compression, storage, transmission, analysis, recognition, and security. In particular, we focus on the following topics:

  • Background-based Surveillance Video Coding/Transcoding
  • Machine learning for multimedia content analysis
  • Multi-camera cooperated surveillance video analysis
  • Large-scale content-based copy detection
  • Social multimedia computing

2.1. Ultra-Efficient Surveillance Video Coding/Transcoding

With the exponentially increasing deployments of the high-definition surveillance cameras, one major challenge for a real-time video surveillance system is how to effectively reduce the bandwidth and storage costs. To address this problem this study is devoted to develop a high-efficiency and low-complexity video codec suitable for surveillance videos.


Related Papers:

  1. Xianguo Zhang, Yonghong Tian*, Tiejun Huang, Siwei Dong, Wen Gao, Optimizing the Hierarchical Prediction and Coding in HEVC for Surveillance and Conference Videos with Background Modeling, IEEE Transactions on Image Processing, 23(10), Oct. 2014. 4511-4526. DOI: 1109/TIP.2014.2352036
  2. Xianguo Zhang, Tiejun Huang*, Yonghong Tian*, Wen Gao, Background-Modeling Based Adaptive Prediction for Surveillance Video Coding, IEEE Transactions on Image Processing, 23(2), Feb 2014, 769-784. DOI: 1109/TIP.2013.2294549.
  3. Wen Gao, Yonghong Tian*, Tiejun Huang, Siwei Ma, Xianguo Zhang, IEEE 1857 Standard Empowering Smart Video Surveillance Systems, IEEE Intelligent Systems, 29(5), Sep.-Oct. 2014, 30-39. DOI: 1109/MIS.2013.101.
  4. Tiejun Huang, Siwei Dong, Yonghong Tian*, Representing Visual Objects in HEVC Coding Loop, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Volume 4, Issue 1, March 2014, 5-16. DOI: 10.1109/JETCAS.2014.2298274.
  5. Xianguo Zhang, Tiejun Huang*, Yonghong Tian*, Mingchao Geng, Siwei Ma, Wen Gao, Fast and Efficient Transcoding Based on Low-complexity Background Modeling and Adaptive Block Classification, IEEE Transactions on Multimedia, 15(8), Dec 2013, 1769-1785.
  6. Tiejun Huang, Yonghong Tian*, Wen Gao, IEEE 1857: Boosting Video Applications in CPSS, IEEE Intelligent Systems, 28(5), 24-27, Sept.-Oct. 2013.
  7. Long Zhao, Yonghong Tian*, Tiejun Huang, Background-Foreground Division based Search for Motion Estimation in Surveillance Video Coding, Proc. 2014 IEEE Int’l Conf. Multimedia and Expo, Chengdu, China, 2014.
  8. Peiyin Xing, Yonghong Tian*, Tiejun Huang, Wen Gao, Surveillance Video Coding with Quadtree Partition Based ROI Extraction, Proc. 30th Picture Coding Symposium, Dec 8-11, 2013, San Jose, California, 1-4.
  9. Peiyin Xing, Yonghong Tian*, Xianguo Zhang, Yaowei Wang, Tiejun Huang, A Coding Unit Classification Based AVC-to-HEVC Transcoding with Background Modeling for Surveillance Videos, Proc. 2013 IEEE Int’l Conf. Visual Communication and Image Processing, Kuching, Malaysia, Nov 2013.
  10. Xianguo Zhang, Tiejun Huang, Yonghong Tian, Wen Gao, Overview of the IEEE 1857 Surveillance Groups, Proc. 2013 IEEE Int’l Conf. Image Processing, Melbourne, Australia, 2013, 1505-1509.
  11. Xianguo Zhang, Tiejun Huang, Yonghong Tian, Wen Gao, Hierarchical-and-Adaptive Bit-allocation with Selective Background Prediction for High Efficiency Video Coding (HEVC), Proc. 2013 Data Compression Conference, 535.
  12. Shumin Han, Xianguo Zhang, Yonghong Tian, Tiejun Huang, An Efficient Background Reconstruction Based Coding Method for Surveillance Videos Captured By Moving Camera, Proc. 2012 IEEE Ninth Int’l Conf. Advanced Video and Signal-Based Surveillance, Beijing, China, Sep 18 2012, 160-165.(EI20124515644282)
  13. Mingchao Geng, Xianguo Zhang, Yonghong Tian*, Luhong Liang, Tiejun Huang, A Fast and Performance-Maintained Transcoding Method Based on Background Modeling for Surveillance Video, Proc. 2012 IEEE Int’l Conf. Multimedia and Expo, pp. 61-67, Melbourne, Australia, Jul 2012.(EI20124515636441)

2.2 Machine learning for multimedia content analysis

Machine learning models and algorithms are widely recognized as “the engine” in most pattern recognition and multimedia content analysis technologies. This research mainly focuses on the typical learning problems in multimedia content analysis, investigates the common statistical machine learning models and methods, consequently providing a theoretical foundation for multimedia intelligent analysis and retrieval.


Related Papers:

  1. Jingjing Yang, Yonghong Tian*, Lingyu Duan, Tiejun Huang, Wen Gao. Group-Sensitive Multiple Kernel Learning for Object Recognition, IEEE Transactions on Image Processing, 21(5), May 2012, 2838-2852.
  2. Yuanning Li, Yonghong Tian*, Lingyu Duan, Jingjing Yang, Tiejun Huang, Wen Gao. Sequence Multi-Labeling: A Unified Video Annotation Scheme with Spatial and Temporal Context. IEEE Transactions on Multimedia, 12(8), Dec. 2010, 814-828.
  3. Jingjing Yang, Yuanning Li, Yonghong Tian*, Lingyu Duan, Wen Gao. Per-Sample Multiple Kernel Approach for Visual Concept Learning. EURASIP Journal on Image and Video Processing, Vol 2010, Article ID 461450, 13 pages.
  4. Yonghong Tian, Qiang Yang, Tiejun Huang, Charles X. Ling and Wen Gao, Learning contextual dependency network models for link-based classification. IEEE Transactions on Knowledge and Data Engineering, 18(11), Nov 2006, 1482-1496.
  5. Yonghong Tian, Tiejun Huang, Wen Gao. Latent Linkage Semantic Kernels for Collective Classification of Link Data. Journal of Intelligent Information Systems, 26(3), May 2006, 269-301.
  6. Yonghong Tian. Context-Based Statistical Relational Learning. AI Communications, 19(3), Sep. 2006, 291-293.
  7. Jingjing Yang, Yuanning Li, Yonghong Tian, Ningyu Duan, Wen Gao. Group-Sensitive Multiple Kernel Learning for Object Categorization. Proc. 12th IEEE Int’l Conf. Computer Vision, Kyoto, Japan, 2009, 436 – 443. (EI20102312998138)
  8. Jingjing Yang, Yuanning Li, Yonghong Tian, Ningyu Duan, Wen Gao. Multiple Kernel Active Learning for Image Classification. Proc. IEEE Int’l Conf. Multimedia and Expo, Hilton Cancun, Cancun, Mexico, 2009, 550-553.(EI20094712492019)
  9. Jingjing Yang, Yuanning Li, Yonghong Tian, Lingyu Duan, Wen Gao. A New Multiple Kernel Approach for Visual Concept Learning, Proc. 15th Int’l Multimedia Modeling Conf., MMM 2009, LNCS 5371, Sophia-Antipolis, France, 2009, 250-262.(EI20090611898947)

2.3 Multi-camera cooperated surveillance video analysis

Video surveillance systems have become one of most important infrastructures for social security and emergency management applications. This study puts its main focus on the challenging research issues and key technologies about multi-camera cooperated object detection, tracking, and event analysis on large-scale surveillance video data. Its long-term objective is to provide key technologies and solutions for the next-generation intelligent video surveillance systems and applications.


Related Papers:

  1. Peixi Peng, Yonghong Tian*, Yaowei Wang, Jia Li, Tiejun Huang, Robust Multiple Cameras Pedestrian Detection with Multi-view Bayesian Network, Pattern Recognition, Accepted, 9 Dec 2014.
  2. Yonghong Tian, Yaowei Wang, Zhipeng Hu, Tiejun Huang, Selective Eigenbackground for Background Modeling and Subtraction in Crowded Scenes, IEEE Transactions on Circuits and Systems for Video Technology. 23(11), 2013, 1849-1864.
  3. Teng Xu, Tiejun Huang,Yonghong Tian, Survey on Pedestrian Detection Technology for On-board Vision Systems, Journal of Image and Graphics, 18(4), 2013, 359-367. [In Chinese]
  4. Lan Wei, Yonghong Tian*, Yaowei Wang, Tiejun Huang, Swiss-System based Cascade Ranking for Gait-based Person Re-identification, Proc. AAAI 2015, Jan 26, 2015.
  5. Lan Wei, Yonghong Tian*, Yaowei Wang, Tiejun Huang, Multi-view Gait Recognition with Incomplete Training Data, Proc. 2014 IEEE Int’l Conf. Multimedia and Expo, Chengdu, China, 2014.
  6. Jiaqiu Chen, Yaowei Wang, Yonghong Tian, Tiejun Huang, Wavelet Based Smoke Detection Method with RGB Constrast-Image and Shape Constraint, Proc. 2013 IEEE Int’l Conf. Visual Communication and Image Processing, Kuching, Malaysia, Nov 2013.
  7. Chaoran Gu, Luantian Mou, Yonghong Tian*, Tiejun Huang, MPLBoost-based Mixture Model for Effective Human Detection with Deformable Part Model, Proc. 2013 IEEE Int’l Conf. Multimedia and Expo, San Jose, CA, USA, 2013, 1-6.
  8. Xiaoyu Fang, Yonghong Tian*, Yaowei Wang, Chi Su, Teng Xu, Ziwei Xia, Peixi Peng, Wen Gao, Pair-wise Event Detection using Cubic Features and Sequence Discriminant Learning, Proc. 2013 IEEE Int’l Conf. Multimedia and Expo, San Jose, CA, USA, 2013, 1-6.
  9. Xiaoyu Fang, Ziwei Xia, Chi Su, Teng Xu, Yonghong Tian*, Yaowei Wang, Tiejun Huang, A System based on Sequence Learning for Event Detection in Surveillance Video, Proc. 2013 IEEE Int’l Conf. Image Processing, Melbourne, Australia, 2013, 3587-3591.
  10. Peixi Peng, Yonghong Tian*, Yaowei Wang, Tiejun Huang, Multi-camera Pedestrian Detection with a Multi-view Bayesian Network Model, Proc. 2012 British Machine Vision Conf., paper 69, pp. 1-12, Guildford, UK, 2012.
  11. Teng Xu, Peixi Peng, Xiaoyu Fang, Chi Su, Yaowei Wang*, Yonghong Tian*, Wei Zeng, Tiejun Huang, Single and Multiple View Detection, Tracking and Video Analysis in Crowded Environments, Proc. 2012 IEEE Ninth Int’l Conf. Advanced Video and Signal-Based Surveillance, pp. 494-499, Beijing, China, Sep 2012.(EI20124515644337)

2.4 Large-scale content-based copy detection

The Internet is revolutionizing multimedia content distribution, offering users unprecedented opportunities to share digital images, audio, and video but also presenting major challenges for digital rights management (DRM) challenges. Base on the audio-visual perception theory and mechanism, this study is trying to investigate the theory and methodologies of robust mediaprinting technology which can be used to efficiently identify media objects with same or similar content. It is deemed that this technology will play an important role in the new-generation multimedia security systems.

mediaprinting (From: T.J. Huang, Y.H. Tian, W. Gao, J. Lu, Mediaprinting: identifying multimedia content for digital rights management, Computer, 43(12), 2010, 28-35.)

Related Papers:

  1. Yonghong Tian, Mengren Qian, Tiejun Huang, TASC: A Transformation-Aware Soft Cascading Approach for Multimodal Video Copy Detection, ACM Transactions on Information Systems, Accepted, 21 Oct 2014.
  2. Luntian Mou, Tiejun Huang, Yonghong Tian, Menglin Jiang, Wen Gao, Content-Based Copy Detection through Multimodal Feature Representation and Temporal Pyramid Matching, ACM Trans. Multimedia Comput. Commun. Appl., 10(1), Article 5 (December 2013), 20
  3. Yonghong Tian, Tiejun Huang, Menglin Jiang, and Wen Gao, Video Copy Detection and Localization with a Scalable Cascading Framework, IEEE Multimedia, 20(3), Sep. 2013, 72-86.
  4. Yonghong Tian, Tiejun Huang, Wen Gao, Multimodal Video Copy Detection using Multi-Detectors Fusion, IEEE COMSOC MMTC E-Letter, 7(5), September 2012, 3-6.
  5. Tiejun Huang, Yonghong Tian*, Wen Gao, Jian Lu. Mediaprinting: Identifying Multimedia Content for Digital Rights Management. Computer, 43(12), Dec. 2010, 28-35.
  6. Mengren Qian, Luntian Mou, Jia Li, and Yonghong Tian*. Video Picture-in-Picture Detection using Spatio-Temporal Slicing. Proc. ICME’2014 Workshop on Emerg. Multimedia Sys. and Appl., Chengdu, China, 2014.
  7. Menglin Jiang, Yonghong Tian*, Tiejun Huang, Video Copy Detection Using a Soft Cascade of Multimodal Features, Proc. 2012 IEEE Int’l Conf. Multimedia and Expo, Melbourne, Australia, 374-379, 2012.(EI20124515636492)
  8. Luntian Mou, Xilin Chen, Yonghong Tian, Tiejun Huang. Robust and Disrimnative Image Authentication Based on Standard Model Feature, Proc. 2012 IEEE Int’l Symposium on Circuit & System, Seoul, Korea, 1131-1134, 2012.
  9. Yonghong Tian, Menglin Jiang, Luntian Mou, Xiaoyu Fang, Tiejun Huang. A Multimodal Video Copy Detection Approach with Sequential Pyramid Matching, Proc. IEEE Int’l Conf. Image Processing (ICIP 2011), Brussels, Belgium, Sep. 2011, 3690-3693. (EI20120514730536)

2.5 Social multimedia computing

Social multimedia and interactive video are becoming two of the most attractive technologies in new media applications. This research focuses on the fundamental theory, models, and methodologies in various social multimedia applications.


Related Papers:

  1. Yonghong Tian, Jaideep Srivastava, Tiejun Huang, and Noshir Contractor. Social Multimedia Computing. Computer, 43(8), Aug. 2010, 27-36. (WOS:000280949000008)(Cover Feature)
  2. Wen Gao, Yonghong Tian*, Tiejun Huang, Qiang. Vlogging: A Survey of Video Blogging Technology on the Web. ACM Computing Surveys, 42(4), Jun. 2010, article 15, 57 pages.
  3. Yonghong Tian, Shui Yu, Chin-Yung Lin, Wen Gao, Wanlei Zhou, Special Issue on Social Multimedia Computing: Challenges, Techniques, and Applications: Guest Editorial, Journal of Multimedia, 9(1), 2014, 1-3.
  4. Shui Yu, Yonghong Tian*, Song Guo, Dapeng Oliver Wu, Can We Beat DDoS Attacks in Clouds? IEEE Transactions on Parallel and Distributed Systems, 25(9), Sep. 2013, 2245-2254. DOI: 10.1109/TPDS.2013.181.
  5. Zhongfei Zhang, Zhengyou Zhang, Ramesh Jain, Yueting Zhuang, Noshir CONTRACTOR, Alexander G. HAUPTMANN, Alejandro (Alex) JAIMES, Wanqing LI, Alexander C. LOUI, Tao MEI, Nicu SEBE, Yonghong Tian, Vincent S. TSENG, Qing WANG, Changsheng XU, Huimin YU, Shiwen YU, Societally connected multimedia across cultures, Journal of Zhejiang University SCIENCE C, 13(12), 2012, 875-880.(WOS:000312185500001)
  6. Amogh Mahapatra, Xin Wan, Yonghong Tian, and Jaideep Srivastava. Augmenting Image Processing with Social Tag Mining for Landmark Recognition. Proc. 17th Int’l Multimedia Modeling Conf., MMM 2011, Jan 5-6, 2011, Taiwan, China, 273-283.(EI20110413622029)

Research Overview

The multimedia learning group at the NELVT lab is dedicated to new theories, cutting-edge algorithms and core technologies for multimedia content analysis, coding and protection in a wide spectrum of next-generation multimedia applications. These technologies are expected to play a crucial role in the further development of digital media industry.

Featured Direction: Multimedia Big Data

Multimedia is increasingly becoming the “biggest big data” as the most important and valuable source for insights and information. It covers from everyone’s experiences to everything happening in the world. There will be lots of multimedia big data — surveillance video, entertainment and social media, medical images, consumer images, voice and video, to name a few, only if their volumes grow to the extent that the traditional multimedia processing and analysis systems cannot handle effectively. As such, multimedia big data will emerge as the next “must have” competency in our society, and is spurring on tremendous amounts of research and development of related technologies and applications.

Multimedia big data introduces many technological challenges, including compression, storage, transmission, analysis, recognition, and security. Among them, two major grand challenges are how to extra-efficiently compress the huge amount of data so as to facilitate transmission and storage, and how to intelligently analyze, mine and understand the multimedia information inside from such a huge amount of big data. Take surveillance video as an example. According to a recent report by IDC, by 2020, as much as 5,800 Exabytes of surveillance videos will be saved, transmitted and analyzed, averagely doubling the data volume every two years. However, traditionally, the average compression rate in the field of video coding increases ~2x every decade. This will lead to a huge gap between the two rates in future several years, consequently presenting an unprecedented challenge for ultra-high efficiency and low-complexity video coding technology. More importantly, only a small percentage of the data would be useful and valuable if they were tagged and analyzed. Yet, technology is far from where it needs to be, and in practice, only 3 percent of potentially useful data is tagged — and even less is currently being analyzed. In this sense, the huge amount of surveillance videos generated by thousands of cameras may become the data tsunami.

As an active and inter-disciplinary research field, multimedia big data also presents a great opportunity for multimedia computing in the big data era. The challenges and opportunities highlighted in this field will foster some interesting future developments in the multimedia research and applications.

Reading Materials:

1. Surveillance Video: The Biggest Big Data

2. Video Big Data: Challenges and Opportunities [in Chinese]