Week 5: Computer Vision for Robotics
From Pixels to Semantics
State Estimation gives us the robot's body state. Computer Vision gives us the state of the world around it.
Key Modalities
- RGB: Color information. Great for semantics ("That is a door").
- Depth (D): Geometry information. Great for obstacle avoidance.
- Point Clouds: 3D representation of the world.
RGB-D Pipelines
The standard "classic" pipeline:
- Receive Depth Image.
- Project pixels to 3D points using camera intrinsics.
- Filter noise (Voxel Grid Downsampling).
- Fit planes to find the floor.
- Cluster remaining points to find objects.
Transformers in Vision (ViT)
Convolutional Neural Networks (CNNs) are being replaced by Vision Transformers (ViT). ViTs maintain a global context of the image, which is crucial for understanding spatial relationships.
Lab: Object Detection with YOLOv8
We will use the Ultralytics YOLO library to detect objects relevant to humanoids (cups, bottles, chairs).
Step 1: Install
pip install ultralytics opencv-python
Step 2: Real-time Detection
from ultralytics import YOLO
import cv2
# Load a pretrained model (YOLOv8n is 'nano' - fast!)
model = YOLO('yolov8n.pt')
# Open Webcamera
cap = cv2.VideoCapture(0)
while cap.isOpened():
success, frame = cap.read()
if success:
# Run inference
results = model(frame)
# Visualize
annotated_frame = results[0].plot()
cv2.imshow("YOLOv8 Inference", annotated_frame)
if cv2.waitKey(1) & 0xFF == ord("q"):
break
else:
break
cap.release()
cv2.destroyAllWindows()
Segment Anything Model (SAM)
For grasping, bounding boxes aren't enough. We need Segmentation Masks. The Segment Anything Model (SAM) allows us to prompt the model with a point ("click on the bottle") and get a pixel-perfect mask for grasping.