PhD in Computer Science and Technology
Zhejiang University, 2020
Cross-Modal Conditioned Reconstruction for Language-Guided Medical Image Segmentation
Article
Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning
Article
Article
Article
KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification
Article
Knowledge Integration for Grounded Situation Recognition
Article
Learning Causal Transition Matrix for Instance-dependent Label Noise
Article
Learning Combinatorial Prompts for Universal Controllable Image Captioning
Article
Recent advances in finetuning multimodal large language models
Article
An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding
Conference paper
CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing
Conference paper
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Conference paper
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
Conference paper
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
Conference paper
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
Conference paper
Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-scarce Classification
Conference paper
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
Conference paper
Multi-Resolution Decomposable Diffusion Model for Non-Stationary Time Series Anomaly Detection
Conference paper
Open-World Multimodal Understanding and Generation with Efficiently Finetuned Foundation Models
Conference paper
Conference paper
View-Consistent 3D Editing with Gaussian Splatting
Conference paper
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Article
Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities
Article
CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention
Article
Decomposed Prototype Learning for Few-Shot Scene Graph Generation
Article
Di2Pose: Discrete Diffusion Model for Occluded 3D Human Pose Estimation
Article
GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning
Article
Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards
Article
In Defense of Clip-Based Video Relation Detection
Article
Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation
Article
LLMs Can Evolve Continually on Modality for X-Modal Reasoning
Article
NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation
Article
Conference paper
Conference paper
Distributionally Generative Augmentation for Fair Facial Attribute Classification
Conference paper
Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning
Conference paper
Conference paper
MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding
Conference paper
Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning
Conference paper
Conference paper
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Conference paper
SCHEMA: State Changes Matter for Procedure Planning in Instructional Videos
Conference paper
Seeing beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer
Conference paper
The 2nd International Workshop on Deep Multi-modal Generation and Retrieval
Conference paper
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
Conference paper
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach
Article
Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering
Article
Federated unsupervised representation learning
Article
VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language Matching
Article
Conference paper
Compositional Feature Augmentation for Unbiased Scene Graph Generation
Conference paper
COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION
Conference paper
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond
Conference paper
Discrepancy-Guided Reconstruction Learning for Image Forgery Detection
Conference paper
Conference paper
FAIRNESS-AWARE CONTRASTIVE LEARNING WITH PARTIALLY ANNOTATED SENSITIVE ATTRIBUTES
Conference paper
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Conference paper
Iterative Proposal Refinement for Weakly-Supervised Video Grounding
Conference paper
Reading Arbitrary-Shaped Scene Text from Images Through Spline Regression and Rectification
Conference paper
TempCLR: Temporal Alignment Representation with Contrastive Learning
Conference paper
Conference paper
Video Referring Expression Comprehension via Transformer with Content-conditioned Query
Conference paper
VIDEO SCENE GRAPH GENERATION FROM SINGLE-FRAME WEAK SUPERVISION
Conference paper
Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models
Conference paper
Deep Learning for Weakly-Supervised Object Detection and Localization: A Survey
Article
Deep Motion Prior for Weakly-Supervised Temporal Action Localization
Article
Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs
Conference paper
Correspondence Matters for Video Referring Expression Comprehension
Conference paper
CROSSFORMER: A VERSATILE VISION TRANSFORMER HINGING ON CROSS-SCALE ATTENTION
Conference paper
Deconfounded Value Decomposition for Multi-Agent Reinforcement Learning
Conference paper
Explicit Image Caption Editing
Conference paper
Few-Shot Object Detection with Fully Cross-Transformer
Conference paper
Conference paper
Respecting Transfer Gap in Knowledge Distillation
Conference paper
Rethinking Data Augmentation for Robust Visual Question Answering
Conference paper
Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives
Conference paper
Rethinking the Evaluation of Unbiased Scene Graph Generation
Conference paper
Rethinking the Reference-based Distinctive Image Captioning
Conference paper
Rethinking the Two-Stage Framework for Grounded Situation Recognition
Conference paper
The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation
Conference paper
Towards Multi-level Fairness and Robustness on Federated Learning
Conference paper
Weakly-Supervised Temporal Article Grounding
Conference paper
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric
Conference paper
Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework
Conference paper
Accurate Arbitrary-Shaped Scene Text Detection via Iterative Polynomial Parameter Regression
Conference paper
An Adaptive Rectification Model for Arbitrary-Shaped Scene Text Recognition
Conference paper
Boundary Proposal Network for Two-Stage Natural Language Video Localization
Conference paper
Human-like Controllable Image Captioning with Verb-specific Semantic Roles
Conference paper
Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based Explanation
Conference paper
Natural Language Video Localization with Learnable Moment Proposals
Conference paper
On Pursuit of Designing Multi-modal Transformer for Video Grounding
Conference paper
Optimizing Federated Learning on Non-IID Data Using Local Shapley Value
Conference paper
Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding
Conference paper
ROBUST VIDEO TEXT DETECTION THROUGH PARAMETRIC SHAPE REGRESSION, PROPAGATION AND FUSION
Conference paper
Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning
Conference paper
Video Relation Detection via Tracklet based Visual Transformer
Conference paper
Counterfactual samples synthesizing for robust visual question answering
Article
Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
Article
Hierarchical Fashion Graph Network for Personalized Outfit Recommendation
Conference paper
Rethinking the bottom-up framework for query-based video localization
Conference paper
Counterfactual critic multi-agent training for scene graph generation
Conference paper
DebuG: A dense bottom-up grounding approach for natural language video localization
Conference paper
Learning using privileged information for food recognition
Conference paper
Zero-Shot Visual Recognition Using Semantics-Preserving Adversarial Embedding Networks
Conference paper
SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning
Conference paper
Video question answering via attribute-Augmented attention network learning
Conference paper
Cross-Modal Conditioned Reconstruction for Language-Guided Medical Image Segmentation
Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning
KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification
Learning Causal Transition Matrix for Instance-dependent Label Noise
Learning Combinatorial Prompts for Universal Controllable Image Captioning
Recent advances in finetuning multimodal large language models
An Efficient and Effective Transformer Decoder-Based Framework for Multi-task Visual Grounding
CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-scarce Classification
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
Multi-Resolution Decomposable Diffusion Model for Non-Stationary Time Series Anomaly Detection
Open-World Multimodal Understanding and Generation with Efficiently Finetuned Foundation Models
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities
CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention
Decomposed Prototype Learning for Few-Shot Scene Graph Generation
Di2Pose: Discrete Diffusion Model for Occluded 3D Human Pose Estimation
GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning
Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards
Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation
LLMs Can Evolve Continually on Modality for X-Modal Reasoning
NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation
Distributionally Generative Augmentation for Fair Facial Attribute Classification
Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning
MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding
Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
SCHEMA: State Changes Matter for Procedure Planning in Instructional Videos
Seeing beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer
The 2nd International Workshop on Deep Multi-modal Generation and Retrieval
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach
Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering
VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language Matching
Compositional Feature Augmentation for Unbiased Scene Graph Generation
COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond
Discrepancy-Guided Reconstruction Learning for Image Forgery Detection
FAIRNESS-AWARE CONTRASTIVE LEARNING WITH PARTIALLY ANNOTATED SENSITIVE ATTRIBUTES
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Iterative Proposal Refinement for Weakly-Supervised Video Grounding
Reading Arbitrary-Shaped Scene Text from Images Through Spline Regression and Rectification
TempCLR: Temporal Alignment Representation with Contrastive Learning
Video Referring Expression Comprehension via Transformer with Content-conditioned Query
VIDEO SCENE GRAPH GENERATION FROM SINGLE-FRAME WEAK SUPERVISION
Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models
Deep Learning for Weakly-Supervised Object Detection and Localization: A Survey
Deep Motion Prior for Weakly-Supervised Temporal Action Localization
Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs
Correspondence Matters for Video Referring Expression Comprehension
CROSSFORMER: A VERSATILE VISION TRANSFORMER HINGING ON CROSS-SCALE ATTENTION
Deconfounded Value Decomposition for Multi-Agent Reinforcement Learning
Rethinking Data Augmentation for Robust Visual Question Answering
Rethinking Multi-Modal Alignment in Multi-Choice VideoQA from Feature and Sample Perspectives
Rethinking the Evaluation of Unbiased Scene Graph Generation
Rethinking the Two-Stage Framework for Grounded Situation Recognition
The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation
Towards Multi-level Fairness and Robustness on Federated Learning
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric
Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework
Accurate Arbitrary-Shaped Scene Text Detection via Iterative Polynomial Parameter Regression
An Adaptive Rectification Model for Arbitrary-Shaped Scene Text Recognition
Boundary Proposal Network for Two-Stage Natural Language Video Localization
Human-like Controllable Image Captioning with Verb-specific Semantic Roles
Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based Explanation
Natural Language Video Localization with Learnable Moment Proposals
On Pursuit of Designing Multi-modal Transformer for Video Grounding
Optimizing Federated Learning on Non-IID Data Using Local Shapley Value
Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding
ROBUST VIDEO TEXT DETECTION THROUGH PARAMETRIC SHAPE REGRESSION, PROPAGATION AND FUSION
Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning
Video Relation Detection via Tracklet based Visual Transformer
Counterfactual critic multi-agent training for scene graph generation
Conference paper
DebuG: A dense bottom-up grounding approach for natural language video localization
Conference paper
Learning using privileged information for food recognition
Conference paper
Zero-Shot Visual Recognition Using Semantics-Preserving Adversarial Embedding Networks
Conference paper
SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning
Conference paper
Video question answering via attribute-Augmented attention network learning
Conference paper
| COMP4901Z | Reinforcement Learning |
| COMP4971A | Independent Work |
| COMP4971D | Independent Work |
| COMP4981H | Final Year Thesis |
| COMP6922E | Research Project |
| COMP4971A | Independent Work |
| COMP4981H | Final Year Thesis |
| COMP4981 | Final Year Project |
| COMP6411C | Advanced Topics in Multimodal Machine Learning |
| COMP6922E | Research Project |
| COMP6922I | Research Project |
| COMP4901Z | Reinforcement Learning |
| COMP4971A | Independent Work |
| COMP4981 | Final Year Project |
| COMP6922E | Research Project |
| COMP6922I | Research Project |
| COMP4971A | Independent Work |
| COMP4981 | Final Year Project |
| UROP1000 | Undergraduate Research Opportunities |
| UROP1100M | Undergraduate Research Opportunities Series 1 |
| No Teaching Assignments |
CHEN, Hongxu
Computer Science and Engineering
HE, Zhenqi
Computer Science and Engineering
JIANG, Ziqi
Computer Science and Engineering
LI, Hongxiang
Computer Science and Engineering
LIU, Junzhang
Computer Science and Engineering
WANG, Shaodong
Computer Science and Engineering
CHEN, Wei
Computer Science and Engineering
LIU, Jiazhen
Computer Science and Engineering
PHAM, Trung Kien
Computer Science and Engineering
TAN, Chaolei
Computer Science and Engineering
WANG, Yanghao
Computer Science and Engineering
ZHANG, Haoqi
Computer Science and Engineering
Update your browser to view this website correctly. Update your browser now