Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models Z Lin, C Liu, R Zhang, P Gao, L Qiu, H Xiao, H Qiu, C Lin, W Shao, ... arXiv preprint arXiv:2311.07575, 2023 | 206 | 2023 |
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models P Xu, W Shao, K Zhang, P Gao, S Liu, M Lei, F Meng, S Huang, Y Qiao, ... IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024 | 177 | 2024 |
Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners R Zhang, X Hu, B Li, S Huang, H Deng, Y Qiao, P Gao, H Li Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2023 | 166 | 2023 |
Multi-modal sensor fusion for auto driving perception: A survey K Huang, B Shi, X Li, X Li, S Huang, Y Li arXiv preprint arXiv:2202.02703, 2022 | 140 | 2022 |
Instruct2act: Mapping multi-modality instructions to robotic actions with large language model S Huang, Z Jiang, H Dong, Y Qiao, P Gao, H Li arXiv preprint arXiv:2305.11176, 2023 | 120 | 2023 |
Sphinx-x: Scaling data and parameters for a family of multi-modal large language models D Liu, R Zhang, L Qiu, S Huang, W Lin, S Zhao, S Geng, Z Lin, P Jin, ... arXiv preprint arXiv:2402.05935, 2024 | 88* | 2024 |
Tiny lvlm-ehub: Early multimodal experiments with bard W Shao, Y Hu, P Gao, M Lei, K Zhang, F Meng, P Xu, S Huang, H Li, ... arXiv preprint arXiv:2308.03729, 2023 | 35 | 2023 |
Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill W Cai, S Huang, G Cheng, Y Long, P Gao, C Sun, H Dong ICRA2024, 2023 | 28 | 2023 |
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices Q Lu, W Shao, Z Liu, F Meng, B Li, B Chen, S Huang, K Zhang, Y Qiao, ... arXiv preprint arXiv:2406.08451, 2024 | 16 | 2024 |
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models X Lu, Q Liu, Y Xu, A Zhou, S Huang, B Zhang, J Yan, H Li arXiv preprint arXiv:2402.14800, 2024 | 15 | 2024 |
Amex: Android multi-annotation expo dataset for mobile gui agents Y Chai, S Huang, Y Niu, H Xiao, L Liu, D Zhang, P Gao, S Ren, H Li arXiv preprint arXiv:2407.17490, 2024 | 14 | 2024 |
Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models S Huang, I Ponomarenko, Z Jiang, X Li, X Hu, P Gao, H Li, H Dong IROS2024, 2024 | 12 | 2024 |
SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification S Huang, B Zhang, B Shi, H Li, Y Li, P Gao Proceedings of the 31st ACM International Conference on Multimedia, 8644-8652, 2023 | 12 | 2023 |
A3VLM: Actionable Articulation-Aware Vision Language Model S Huang, H Chang, Y Liu, Y Zhu, H Dong, P Gao, A Boularias, H Li Conference on Robot Learning (CoRL), 2024 | 9 | 2024 |
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want W Lin, X Wei, R An, P Gao, B Zou, Y Luo, S Huang, S Zhang, H Li arXiv preprint arXiv:2403.20271, 2024 | 6 | 2024 |
Adas: A simple active-and-adaptive baseline for cross-domain 3d semantic segmentation B Fei, S Huang, J Yuan, B Shi, B Zhang, T Chen, M Dou, Y Qiao arXiv preprint arXiv: 2212.10390, 2022 | 5 | 2022 |
PixWizard: Versatile image-to-image visual assistant with open-language instructions W Lin, X Wei, R Zhang, L Zhuo, S Zhao, S Huang, J Xie, Y Qiao, P Gao, ... arXiv preprint arXiv:2409.15278, 2024 | 3 | 2024 |
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation S Huang, L Chen, P Zhou, S Chen, Z Jiang, Y Hu, P Gao, H Li, M Yao, ... arXiv preprint arXiv:2501.01895, 2025 | | 2025 |
A3: Android Agent Arena for Mobile GUI Agents Y Chai, H Li, J Zhang, L Liu, G Wang, S Ren, S Huang, H Li arXiv preprint arXiv:2501.01149, 2025 | | 2025 |
SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models Z Lin, D Liu, R Zhang, P Gao, L Qiu, H Xiao, H Qiu, W Shao, K Chen, ... European Conference on Computer Vision, 36-55, 2025 | | 2025 |