ViTaMIn-B: A Reliable and Efficient Visuo-Tactile Bimanual Manipulation Interface

Chuanyu Li*1, Chaoyi Liu*1, Daotan Wang1, Shuyu Zhang4,
Lusong Li3, Zecui Zeng3, Fangchen Liu2, Jing Xu1, Rui Chen1
1Tsinghua University, 2University of California, Berkeley
3JD Explore Academy, 4The Hong Kong Polytechnic University
* Equal contribution, Equal advising

If you have any questions, please feel free to contact us at chuanyu.ne79@gmail.com

teaser


Abstract

Handheld devices have opened up unprecedented opportunities to collect large-scale, high-quality demonstrations efficiently. However, existing systems often lack robust tactile sensing or reliable pose tracking to handle complex interaction scenarios, especially for bimanual and contact-rich tasks. In this work, we propose ViTaMIn-B, a more capable and efficient handheld data collection system for such tasks. We first design DuoTact, a novel compliant visuo-tactile sensor built with a flexible frame to withstand large contact forces during manipulation while capturing high-resolution contact geometry. To enhance the cross-sensor generalizability, we propose reconstructing the sensor's global deformation as a 3D point cloud and using it as the policy input. We further develop a robust, unified 6-DoF bimanual pose acquisition process using Meta Quest controllers, which eliminates the trajectory drift issue in common SLAM-based methods. Comprehensive user studies confirm the efficiency and high usability of ViTaMIn-B among novice and expert operators. Furthermore, experiments on four bimanual manipulation tasks demonstrate its superior task performance relative to existing systems.

Data Collection System


System Overview

  • DuoTact Visuotactile Sensor
  • Generalizable Tactile Representation
  • Robust Bimanual Pose Tracking

Hardware Design

Hardware-Design

ViTaMIn-B is a system developed for bimanual visuo-tactile data collection. The system integrates a GoPro Hero 10 camera for vision observation, Meta Quest 3 controllers for 6-DoF bimanual pose acquisition, and two DuoTact sensors for tactile sensing. Gripper width with a maximum span of 8 cm is computed by detecting ArUco markers on the gripper.

As bimanual manipulation demonstration collection occupies both hands, a foot pedal is used to trigger the start and end of recording, enabling efficient single-operator data collection.

Several improvements were introduced:

1. The novel visuotactile sensors (DuoTact) are developed to produce clearer tactile signals across diverse contact scenarios and better contact support.

2. We replace the SLAM-based tracking with the Meta Quest 3, providing accurate, real-time 6-DoF poses for both handheld devices.

3. The mechanical structure is orignally designed for improved ergonomics and reduced weight by removing onboard computing (e.g., Raspberry Pi) and interfacing all sensors directly with the host computer.

4. All sensing modalities are latency-calibrated and synchronized to ensure precise spatiotemporal alignment.



Tactile Sensor

Structural Composition

process
  • TPU Frame: Flexible metamaterial structure for shape adaptation
  • Contact Layer: The multi-layer structure consists of: A 1.6-mm-thick transparent PVC film to support the contact layer; A transparent silicone gel base deforming to reveal local contact conditions; A reflective layer; A black coating overlay.
  • Black Rubber Sheet: Black rubber sheets are heat-sealed on the two sides of the TPU frame to prevent ambient light.
  • LED Strip Lights: To capture the contact geometry details on the surface, programmable LED strip lights (400 mm × 2.7 mm) are integrated within the TPU frame. Moreover, the illumination can enhance the visibility of the contact layer’s edges in the image, facilitating more accurate point cloud reconstruction.
  • RGB Camera: 640×480 resolution at 30 fps for internal monitoring

Fabrication Process

(1) A PVC film is inserted into the mold cavity and coated with a transparent silicone adhesive. Subsequently, 10g of transparent silicone gel (Wacker Elastosil® RT 601, A:B = 9:1 by weight) is poured into the mold and cured at 60°C for 30 minutes.


(2) A reflective coating, consisting of Posilicone Translucent silicone (A:B = 1:1 by weight) and white pigment (2:0.1 weight ratio), is applied onto the cured transparent silicone surface. This layer is then cured at 60°C for 20 minutes.


(3) A black coating, consisting of Novocs Matte matting agent, Ecoflex 00-10 (A:B = 1:1 by weight), and black pigment (26:6:1 weight ratio), is uniformly airbrushed over the reflective layer and dried at 60°C for 20 minutes.


(4) For final assembly, the two fabricated contact layers are slid into slots on the TPU frame. A black rubber sheet is then attached to the frame's exterior via heat sealing. Subsequently, the LED strip light is threaded through designated slots on the frame. Finally, the RGB camera and the TPU frame assembly are mounted onto the finger bracket using screws and nuts.


process

Diagram of the fabrication process for DuoTact



Global Deformation Reconstruction

The captured raw tactile images may vary across sensors due to manufacturing imperfections, which deteriorates the cross-sensor generalizability of learned policies. Therefore, we propose to reconstruct the globally deformed point cloud of the sensor from a single image by utilizing the edge features and spatial constraints present in the image. This point cloud is then used as the input to the policy network. The reconstruction principle is shown in figure below.

reconstruction

Principle and result diagram of point cloud recon- struction.

BibTeX


      @article{li2025vitamin,
        title={ViTaMIn-B: A Reliable and Efficient Visuo-Tactile Bimanual Manipulation Interface},
        author={Li, Chuanyu and Liu, Chaoyi and Wang, Daotan and Zhang, Shuyu and Li, Lusong and Zeng, Zecui and Liu, Fangchen and Xu, Jing and Chen, Rui},
        journal={arXiv preprint arXiv:2511.05858},
        year={2025}
      }
    

Visitors Map