From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes
Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.
- Python 3.10
- torch (tested 2.6)
- torchvision
We test the code on Ubuntu 20.04.
git clone https://github.com/IRVLUTD/L2G.git
cd L2G
# Create the conda env
conda create -n L2G python=3.10
conda activate L2G
# Install PyTorch
pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 torchaudio==2.6.0+cu118 --index-url https://download.pytorch.org/whl/cu118
# Install other packages
pip install -e.Please put them into "checkpoints" folder as follows:
checkpoints/
βββ dinov3/
β βββ dinov3_vitl16_pretrain_*.pt
β
βββ SAM/
β βββ sam2.1_hiera_large.pt
β
βββ Adapter/
β βββ High_Res_Adapter.pt
β βββ RoboTools_Adapter.pt
β
βββ Object_tokens_High_Res/
β βββ full_mask_tokens_000001.pt
β βββ full_mask_tokens_000002.pt
β βββ ...
β
βββ Object_tokens_RoboTools/
βββ full_mask_tokens_000001.pt
βββ full_mask_tokens_000002.pt
βββ ...
Setting Up Detection Datasets
The High_Resolution dataset is divided into 22 scenes (Hard : Scene 1β10; Easy: Scene 11-22). Download the dataset:
Please put them into "Data" folder as follows:
data/
β
βββ Query/
β βββ High_Resolution/
β β βββ 000001/
β β βββ 000002/
β β βββ ...
β β
β βββ RoboTools/
β βββ 000001/
β βββ 000002/
β βββ ...
β
βββ Templates/
βββ High_Resolution_all/
β βββ rgb/
β β βββ 000001/
β β βββ 000002/
β β βββ ...
β βββ mask/
β βββ 000001/
β βββ 000002/
β βββ ...
β
βββ RoboTools_all/
βββ rgb/
β βββ 000001/
β βββ 000002/
β βββ ...
β
βββ mask/
βββ 000001/
βββ 000002/
βββ ...
You can directly run the demo:
python run.py --config Demo.yamlor check inference on the image
Sample the template images:
cd tools
# --n 8 : Number of templates to sample per object
# --datasets : Dataset name (e.g., RoboTools; High_Resolution)
python sample_templates.py --n 8 --datasets RoboToolsRun L2G on the Benchmark:
python run.py --config RoboTools.yaml #or High_Res.yaml
# then merge results using tools/utils/merge.py. You can download Ground truth files in the following link.We include the ground truth files and our predictions in this link. You can run eval_results.py to evaluate them.
Download the background with the link. Among these, Backgrounds_2048 is constructed by cropping local regions from the original high-resolution background images, resulting in images of size 2048 Γ 1536.
# Create the template-based training images on RoboTools
python tools/Compose_objects.py \
--objects-root data/Templates/RoboTools_all \
--backgrounds Backgrounds_2048 \
--out-root RoboTools_create \
--bbox-out-root RoboTools_create_bbox \
--start-object-id 1 \
--end-object-id 20Check the training demo in notebooks.
Click the following image to watch the video.
This project is based on the following repositories:



