====== Lab 03: Datasets ======
In this lab we’ll build a **small, clean, and reproducible dataset** from end to end. You’ll practice sourcing images from public repositories, labeling efficiently with **Label Studio** (assisted by **SAM2**), running a **pre-processing pipeline** (cleaning, stratified splits, leakage checks, augmentations, class-imbalance fixes), and generating **synthetic data** with **Unreal Engine 4.27 + UnrealCV** (RGB/Depth/Seg) with conversion to **YOLO** format and light **domain randomization**.
As your homework, you’ll submit **50 labeled samples** from your dataset.
===== Environment & Tools =====
* **Python 3.10+** (venv recommended), **git**.
* **Label Studio** (GUI labeling) + optional **SAM2** ML assist.
* **OpenCV, Albumentations, scikit-learn** for pre-processing and augments.
* **Unreal Engine 4.27** with **UnrealCV** for synthetic scenes.
# Create & activate virtual environment
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows (PowerShell)
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install label-studio label-studio-ml albumentations==1.4.8 opencv-python scikit-learn numpy matplotlib tqdm pycocotools
GPU is optional. SAM2-assisted masks run on CPU but are slower; a CUDA GPU accelerates everything.
===== Dataset skeleton =====
We’ll use a simple, reproducible folder layout to avoid path chaos later.
data/
raw/ # downloaded/unfiltered originals
clean/ # cleaned + deduplicated images
labels/
labels/ # YOLO .txt files (mirrors images by name)
images/ # LS export (for traceability)
classes.txt # class names
notes.json # class info
augmentations/
labels/
images/
classes.txt
notes.json
synth/
rgb/
depth/ # not used
seg/
labels/
split/
train/{images,labels}
val/{images,labels}
test/{images,labels}
hw_submission/
scripts/
01_clean_and_dedup.py
02_label_studio.sh
03_train_test_split.py
04_augment_yolo.py
05_unreal_capture.py
.venv/
Keep a small **class map** (e.g., ''labels.json'') to canonicalize names (''car'' not ''Car'').
====== Part 1: Sourcing public data ======
Goal: assemble **100–300 raw images** for 1–5 target classes. You can mix sources (Kaggle, Open Images, Roboflow Universe, academic repos).
**Suggested sources (with licenses):**
* [[https://www.kaggle.com/datasets|Kaggle Datasets]] - assorted image datasets (check each dataset’s license).
* [[https://universe.roboflow.com/|Roboflow Universe]] - community datasets with convenient exports; verify terms per project.
* [[https://huggingface.co/datasets|Hugging Face Datasets]] - comprehensive search engine for a vast selection of datasets (check licensing).
* [[https://cocodataset.org/#home|MS COCO]] - detection/segmentation/keypoints (Creative Commons; see site for details).
* [[https://www.cityscapes-dataset.com/|Cityscapes]] - urban scenes, semantic/instance segmentation.
* [[http://www.cvlibs.net/datasets/kitti/|KITTI]] - autonomous driving benchmarks (detection/tracking).
* [[https://paperswithcode.com/datasets?task=object-detection|Papers with Code - Datasets]] - searchable index by task.
* [[https://huggingface.co/datasets?task_categories=task_categories:computer-vision|Hugging Face Datasets (vision)]] - many CV datasets with loaders.
* [[https://data.gov/|Data.gov]] - US open data portal (images mixed in; check licensing).
Checklist:
* Verify **license** (CC0, CC-BY, etc.). Record **URL + license + date accessed** in ''DATA_CARD.md''.
* Prefer diverse **scenes** (lighting, backgrounds, sizes, partial occlusions).
* Save a **manifest** (CSV/JSON: url, license, attribution) alongside downloads.
mkdir -p data/raw
# Example: use dataset-specific CLI or curl/wget; store your manifest!
Do not commit raw copyrighted data unless the license permits. Use DVC or Git LFS for large files.
====== Part 2: Labeling with Label Studio (SAM2) ======
Start Label Studio locally and create a project with your taxonomy (e.g., ''car, pedestrian''). Use boxes or polygons.
label-studio start
# open http://localhost:8080
Steps:
1. Create project → define labels (consistent singular names).
2. Import images from data/raw/ (or data/clean/ after Part 3.1 if you prefer).
3. (Optional) Connect Studio ML backend using SAM2 for assisted masks.
4. Label a subset thoroughly.
5. Export as YOLO (or COCO) to data/labels/labelstudio_export/.
Prefer polygons/masks for irregular shapes; use boxes if your model will be box-based. Keep it consistent across the set.
====== Part 3: Pre-processing pipeline ======
==== 3.1 Cleaning & Deduplication ====
Remove corrupt files and near-duplicates to mitigate leakage and bias.
# scripts/01_clean_and_dedup.py
import cv2, os, hashlib, glob, shutil
from tqdm import tqdm
SRC, DST = "data/raw", "data/clean"
os.makedirs(DST, exist_ok=True)
seen = set()
for p in tqdm(glob.glob(f"{SRC}/**/*.*", recursive=True)):
if not (p.lower().endswith((".jpg",".jpeg",".png",".bmp"))):
continue
img = cv2.imread(p)
if img is None:
continue
ok, buf = cv2.imencode(".jpg", img, [int(cv2.IMWRITE_JPEG_QUALITY), 95])
if not ok:
continue
h = hashlib.md5(buf).hexdigest()
if h in seen:
continue
seen.add(h)
shutil.copy2(p, os.path.join(DST, os.path.basename(p)))
print("Kept:", len(seen))
==== 3.2 Convert/export to YOLO ====
Export labels from Label Studio (YOLO format). Launch Label Studio with the following command:
label-studio start
Example Labeling Interface config for use with [[https://ai.meta.com/sam2/|SAM2]]:
SAM2 generates ''brush labels'', which can not be directly used with YOLO's ''segmentation masks''!
==== 3.3 Stratified split (and group-by-source if possible) ====
Stratify by **labels present**; if you know the **source/scene**, group-split to avoid same-scene leakage.
# scripts/03_train_test_split.py
from sklearn.model_selection import train_test_split
import os, glob, shutil
IMAGES = "data/clean"
LABELS = "data/labels/yolo/labels"
OUT = "data/split"
def label_count(lbl):
try:
return sum(1 for _ in open(lbl, "r", encoding="utf-8"))
except:
return 0
imgs = sorted([p for p in glob.glob(f"{IMAGES}/*.*") if p.lower().endswith((".jpg",".jpeg",".png"))])
X, y = [], []
for ip in imgs:
lp = os.path.join(LABELS, os.path.splitext(os.path.basename(ip))[0] + ".txt")
X.append((ip, lp))
y.append(0 if not os.path.exists(lp) or label_count(lp)==0 else 1)
def dump(split, items):
for sub in ["images","labels"]:
os.makedirs(f"{OUT}/{split}/{sub}", exist_ok=True)
for ip, lp in items:
shutil.copy2(ip, f"{OUT}/{split}/images/{os.path.basename(ip)}")
dst = f"{OUT}/{split}/labels/{os.path.splitext(os.path.basename(ip))[0]}.txt"
if os.path.exists(lp):
shutil.copy2(lp, dst)
else:
open(dst,"w").close() # empty labels if no objects
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
X_va, X_te, y_va, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=42)
dump("train", X_tr); dump("val", X_va); dump("test", X_te)
print("Split done.")
==== 3.4 Augmentations & Imbalance ====
Use label-preserving transforms to increase diversity. Target minority classes for class-balanced sampling.
# scripts/04_augment_yolo.py
import albumentations as A, cv2, os, glob
import numpy as np
AUG = A.Compose([
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.35),
A.ColorJitter(p=0.3),
A.MotionBlur(blur_limit=5, p=0.2),
A.Affine(scale=(0.5,1.5), translate_percent=(0.2,0.2), rotate=(-5,5), p=0.5),
], bbox_params=A.BboxParams(format="yolo", label_fields=["cls"]))
IN_IMG = "data/split/train/images"
IN_LBL = "data/split/train/labels"
OUT_IMG = "data/split/train/images_aug"
OUT_LBL = "data/split/train/labels_aug"
os.makedirs(OUT_IMG, exist_ok=True); os.makedirs(OUT_LBL, exist_ok=True)
def read_yolo(lbl_path, w, h):
bxs, cls = [], []
if os.path.exists(lbl_path):
for line in open(lbl_path, "r").read().strip().splitlines():
if not line: continue
c, xc, yc, ww, hh = line.split()
bxs.append([float(xc),float(yc),float(ww),float(hh)])
cls.append(int(c))
return bxs, cls
for ip in glob.glob(f"{IN_IMG}/*.*"):
if not ip.lower().endswith((".jpg",".jpeg",".png")):
continue
base = os.path.splitext(os.path.basename(ip))[0]
lp = os.path.join(IN_LBL, base + ".txt")
img = cv2.imread(ip); h, w = img.shape[:2]
bxs, cls = read_yolo(lp, w, h)
if len(bxs)==0:
pass#continue
out = AUG(image=img, bboxes=bxs, cls=cls)
aug_img, aug_bxs, aug_cls = out["image"], out["bboxes"], out["cls"]
cv2.imwrite(os.path.join(OUT_IMG, base + "_aug.jpg"), aug_img)
with open(os.path.join(OUT_LBL, base + "_aug.txt"), "w") as f:
for c,(xc,yc,ww,hh) in zip(aug_cls, aug_bxs):
f.write(f"{c} {xc:.6f} {yc:.6f} {ww:.6f} {hh:.6f}\n")
Strive for ≤3× ratio between most and least frequent classes in the **train** split. Use class-aware sampling when training.
====== Part 4: Synthetic data (UE 4.27 + UnrealCV → YOLO) ======
We’ll render small synthetic bursts to cover rare poses/backgrounds. You’ll need **UnrealCV** enabled and a minimal scene.
**Domain randomization checklist:**
* Lighting (HDRI/time-of-day/intensity), materials (color/roughness), camera FoV/pose jitter.
* Distractors/backgrounds, slight sensor noise (JPEG, Gaussian), gentle lens distortion.
* Scale/pose variety of target meshes.
# scripts/unreal_capture.py (pseudo client)
# scripts/05_unreal_capture.py
# UnrealCV dataset capture
# Outputs to: data/synth/{rgb, seg, labels} with optional backgrounds in data/backgrounds
# - Picks a camera that actually moves (A/B diff test)
# - Resolves target actor by prefix
# - Colors target UNIQUE_RGB to isolate in /object_mask
# - Spherical camera sampling + look-at
# - YOLO labels from binary mask
from __future__ import annotations
import os, sys, time, math, random, re
import numpy as np
import cv2
from PIL import Image
# ======== USER CONFIG (minimal) ========
PORT = 9005
TARGET_PREFIX = "bone_actor" # e.g., 'bone_actor_20'
UNIQUE_RGB = (255, 0, 255) # color to paint target for mask extraction
NUM_IMAGES = 20
IMG_W, IMG_H = 1280, 960
FOV_DEG = 60
# Camera randomization (keeps target in view)
DIST_RANGE = (150, 600)
YAW_RANGE = (-180, 180)
PITCH_RANGE = (-25, 25)
ROLL_RANGE = (-10, 10)
# Target jitter & rotation
TARGET_BASE_LOC = (0, 0, 100)
TARGET_JITTER = (-5, 5)
RANDOMIZE_TARGET_ROT = True
TROT_PITCH = (-10, 10)
TROT_YAW = (0, 360)
TROT_ROLL = (-10, 10)
# What to save
SAVE_LIT = True # saves lit render to data/synth/rgb/{id}.png
SAVE_MASK = True # saves binary mask to data/synth/seg/{id}_bin.png
SAVE_COMP = True # composite (lit over random background) to data/synth/rgb/{id}_comp.png
SAVE_YOLO = True # YOLO txt to data/synth/labels/{id}.txt
YOLO_CLASS_ID = 0
# ======== PROJECT PATHS (fit course skeleton) ========
PROJ_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
DATA_ROOT = os.path.join(PROJ_ROOT, "data")
SYNTH_ROOT = os.path.join(DATA_ROOT, "synth")
RGB_DIR = os.path.join(SYNTH_ROOT, "rgb")
SEG_DIR = os.path.join(SYNTH_ROOT, "seg")
LABELS_DIR = os.path.join(SYNTH_ROOT, "labels")
PROBE_DIR = os.path.join(SYNTH_ROOT, "_probes")
BACKGROUND_DIR = os.path.join(DATA_ROOT, "backgrounds") # optional folder for composites
def _ensure_dirs():
for d in (RGB_DIR, SEG_DIR, LABELS_DIR, PROBE_DIR):
os.makedirs(d, exist_ok=True)
# backgrounds are optional
def _abs(*parts): return os.path.abspath(os.path.join(*parts))
def _list_backgrounds():
if not os.path.isdir(BACKGROUND_DIR): return []
exts = (".jpg", ".jpeg", ".png", ".bmp")
return [f for f in os.listdir(BACKGROUND_DIR) if f.lower().endswith(exts)]
# ===== UnrealCV =====
try:
import unrealcv
except ImportError:
print("Missing 'unrealcv' package. Install it in the UE Python environment or your venv.", file=sys.stderr)
sys.exit(1)
client = unrealcv.Client(('localhost', PORT))
# ===== Helpers =====
def vget_and_wait(cmd, out_path, timeout=3.0):
out_path = _abs(out_path)
client.request('%s %s' % (cmd, out_path))
t0 = time.time()
while time.time() - t0 < timeout:
if os.path.isfile(out_path) and os.path.getsize(out_path) > 0:
return True
time.sleep(0.05)
return False
def resolve_target_name(prefix):
objs = client.request('vget /objects') or ""
names = objs.split()
if not names: return None
pat = re.compile(r'^%s(?:_\d+)?$' % re.escape(prefix), re.IGNORECASE)
exact = [n for n in names if pat.match(n)]
if exact: return exact[0]
starts = [n for n in names if n.lower().startswith(prefix.lower())]
return starts[0] if starts else None
# ---------- Camera movement: pick a camera that REALLY moves ----------
def _save_probe(cam_token, xyz, pyr, path):
x,y,z = xyz; pitch,yaw,roll = pyr
client.request('vset /camera/%s/location %f %f %f' % (cam_token, x, y, z))
client.request('vset /camera/%s/rotation %f %f %f' % (cam_token, pitch, yaw, roll))
time.sleep(0.05) # settle
return vget_and_wait('vget /camera/%s/lit' % cam_token, path)
def _img_diff(a_path, b_path):
a = cv2.imread(a_path); b = cv2.imread(b_path)
if a is None or b is None: return 0.0
if a.shape != b.shape:
h = min(a.shape[0], b.shape[0]); w = min(a.shape[1], b.shape[1])
a = a[:h,:w]; b = b[:h,:w]
diff = cv2.absdiff(a, b)
return float(diff.mean())
def pick_movable_camera():
client.request('vset /cameras/spawn')
time.sleep(0.2)
tokens = (client.request('vget /cameras') or "").split()
nums = [t for t in tokens if t.isdigit()]
candidates = nums + [t for t in tokens if t not in nums] + [str(i) for i in range(8)]
seen = set()
candidates = [t for t in candidates if not (t in seen or seen.add(t))]
A_loc = ( 0.0, 0.0, 300.0); A_rot = ( -15.0, 0.0, 0.0)
B_loc = (500.0, 200.0, 150.0); B_rot = ( -10.0, 140.0, 0.0)
for cam in candidates:
pA = _abs(PROBE_DIR, "_probeA_%s.png" % cam)
pB = _abs(PROBE_DIR, "_probeB_%s.png" % cam)
okA = _save_probe(cam, A_loc, A_rot, pA)
okB = _save_probe(cam, B_loc, B_rot, pB)
if not (okA and okB):
for p in (pA,pB):
try: os.remove(p)
except: pass
continue
d = _img_diff(pA, pB)
for p in (pA,pB):
try: os.remove(p)
except: pass
if d > 1.0:
print("[INFO] Movable camera selected:", cam, "(diff=%.2f)" % d)
return cam
return None
# ---------- Mask / Composite / YOLO ----------
def isolate_color_mask(mask_rgb_path, rgb):
img_bgr = cv2.imread(mask_rgb_path)
if img_bgr is None: return None
img = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
r,g,b = rgb
lower = np.array([r,g,b], dtype=np.uint8)
upper = np.array([r,g,b], dtype=np.uint8)
return cv2.inRange(img, lower, upper)
def composite_with_bg(bg_files, lit_path, mask_bin_path, out_path):
if not bg_files:
# no backgrounds -> copy lit
Image.open(lit_path).save(out_path); return
imgI = Image.open(lit_path).convert("RGB")
# open a random background (robust to broken files)
while True:
try:
from_path = _abs(BACKGROUND_DIR, random.choice(bg_files))
bg = Image.open(from_path).convert("RGB")
break
except Exception:
if not bg_files: break
continue
bg = bg.resize(imgI.size)
m = Image.open(mask_bin_path).convert("L")
out = Image.composite(imgI, bg, m)
out.save(out_path)
def yolo_from_mask(mask_bin, img_w, img_h, cls_id=0):
if mask_bin is None: return None
x, y, w, h = cv2.boundingRect(mask_bin)
if w == 0 or h == 0: return None
cx = (x + w/2.0) / float(img_w)
cy = (y + h/2.0) / float(img_h)
nw = w / float(img_w)
nh = h / float(img_h)
return np.array([[cls_id, cx, cy, nw, nh]], dtype=np.float32)
# ===== Main =====
def main():
_ensure_dirs()
backgrounds = _list_backgrounds()
# Connect
client.connect()
if not client.isconnected():
print('UnrealCV server is not running. Start PIE or packaged game with UnrealCV plugin.')
sys.exit(-1)
print(client.request('vget /unrealcv/status'))
# Pick a camera that actually moves
cam = pick_movable_camera()
if cam is None:
print("[ERROR] Could not find a movable camera. In UE PIE Output Log, run: vset /cameras/spawn")
client.disconnect(); sys.exit(1)
# Camera setup
client.request('vset /camera/%s/size %d %d' % (cam, IMG_W, IMG_H))
client.request('vset /camera/%s/fov %d' % (cam, FOV_DEG))
# Resolve target actor name from prefix
target = resolve_target_name(TARGET_PREFIX)
if not target:
print("[ERROR] Could not resolve target with prefix '%s'." % TARGET_PREFIX)
print("Sample objects:", (client.request('vget /objects') or "")[:500])
client.disconnect(); sys.exit(1)
print("[INFO] Target resolved:", target)
# Paint target for mask isolation
r,g,b = UNIQUE_RGB
client.request('vset /object/%s/color %d %d %d' % (target, r, g, b))
time.sleep(0.2)
for i in range(NUM_IMAGES):
print("IMAGE:", i)
# Randomize target
tx = TARGET_BASE_LOC[0] + random.uniform(*TARGET_JITTER)
ty = TARGET_BASE_LOC[1] + random.uniform(*TARGET_JITTER)
tz = TARGET_BASE_LOC[2] + random.uniform(*TARGET_JITTER)
client.request('vset /object/%s/location %f %f %f' % (target, tx, ty, tz))
if RANDOMIZE_TARGET_ROT:
pt = random.uniform(*TROT_PITCH)
yt = random.uniform(*TROT_YAW)
rt = random.uniform(*TROT_ROLL)
client.request('vset /object/%s/rotation %f %f %f' % (target, pt, yt, rt))
# Sample camera pose around target
dist = random.uniform(*DIST_RANGE)
yaw = random.uniform(*YAW_RANGE)
pitch = random.uniform(*PITCH_RANGE)
roll = random.uniform(*ROLL_RANGE)
cx = tx + dist * math.cos(math.radians(pitch)) * math.cos(math.radians(yaw))
cy = ty + dist * math.cos(math.radians(pitch)) * math.sin(math.radians(yaw))
cz = tz + dist * math.sin(math.radians(pitch))
dx, dy, dz = (tx - cx, ty - cy, tz - cz)
cam_yaw = math.degrees(math.atan2(dy, dx))
hyp = math.sqrt(dx*dx + dy*dy)
cam_pitch = math.degrees(math.atan2(dz, hyp))
client.request('vset /camera/%s/location %f %f %f' % (cam, cx, cy, cz))
client.request('vset /camera/%s/rotation %f %f %f' % (cam, cam_pitch, cam_yaw, roll))
# Paths inside data/synth
stem = f"{i:06d}"
img_lit = _abs(RGB_DIR, f"{stem}.png") # lit
img_mask_rgb = _abs(SEG_DIR, f"{stem}_mask.png")# object_mask (RGB)
img_mask_bin = _abs(SEG_DIR, f"{stem}_bin.png") # binary
img_comp = _abs(RGB_DIR, f"{stem}_comp.png") # composite
yolo_txt = _abs(LABELS_DIR, f"{stem}.txt") # YOLO
# Capture
if SAVE_LIT and not vget_and_wait('vget /camera/%s/lit' % cam, img_lit):
print("[WARN] lit not saved - skipping frame"); continue
if (SAVE_MASK or SAVE_COMP or SAVE_YOLO) and not vget_and_wait('vget /camera/%s/object_mask' % cam, img_mask_rgb):
print("[WARN] object_mask not saved - skipping frame"); continue
# Mask isolate
mask = isolate_color_mask(img_mask_rgb, UNIQUE_RGB)
if mask is None:
print("[WARN] failed to read mask image - skipping frame"); continue
mask_bin = np.where(mask > 0, 255, 0).astype(np.uint8)
if SAVE_MASK:
cv2.imwrite(img_mask_bin, mask_bin)
# YOLO
if SAVE_YOLO:
yolo = yolo_from_mask(mask_bin, IMG_W, IMG_H, cls_id=YOLO_CLASS_ID)
if yolo is not None:
np.savetxt(yolo_txt, yolo, fmt="%.0f %.6f %.6f %.6f %.6f")
else:
open(yolo_txt, 'w').close()
# Composite
if SAVE_COMP and SAVE_LIT:
composite_with_bg(backgrounds, img_lit, img_mask_bin, img_comp)
client.disconnect()
print("Done. Outputs in:", SYNTH_ROOT)
if __name__ == "__main__":
main()
Result:
{{ :courses:becm33mle:tutorials:unrealcv_output.png?nolink&600 |}}
Keep synthetic and real data **distinguishable** in your project (e.g., ''source: real/synth'').
====== Useful links ======
* Label Studio: https://labelstud.io/guide/
* Albumentations: https://albumentations.ai/docs/
* UnrealCV: http://unrealcv.org/
* YOLO formats (Ultralytics): https://docs.ultralytics.com/datasets/segment/ (see Detection/Seg variants)
====== HW03: 50-sample dataset (5p, due in 1 week) ======
Submit via BRUTE a folder ''data/hw_submission/'' containing:
* **50 dataset samples** (i.e. images + labels)
* **.txt file** - includes a very short description of the dataset and the data.