====== Lab 03: Datasets ====== In this lab we’ll build a **small, clean, and reproducible dataset** from end to end. You’ll practice sourcing images from public repositories, labeling efficiently with **Label Studio** (assisted by **SAM2**), running a **pre-processing pipeline** (cleaning, stratified splits, leakage checks, augmentations, class-imbalance fixes), and generating **synthetic data** with **Unreal Engine 4.27 + UnrealCV** (RGB/Depth/Seg) with conversion to **YOLO** format and light **domain randomization**. As your homework, you’ll submit **50 labeled samples** from your dataset. ===== Environment & Tools ===== * **Python 3.10+** (venv recommended), **git**. * **Label Studio** (GUI labeling) + optional **SAM2** ML assist. * **OpenCV, Albumentations, scikit-learn** for pre-processing and augments. * **Unreal Engine 4.27** with **UnrealCV** for synthetic scenes.


# Create & activate virtual environment
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows (PowerShell)
.\.venv\Scripts\Activate.ps1

pip install --upgrade pip
pip install label-studio label-studio-ml albumentations==1.4.8 opencv-python scikit-learn numpy matplotlib tqdm pycocotools

GPU is optional. SAM2-assisted masks run on CPU but are slower; a CUDA GPU accelerates everything. ===== Dataset skeleton ===== We’ll use a simple, reproducible folder layout to avoid path chaos later.


data/
  raw/                  # downloaded/unfiltered originals
  clean/                # cleaned + deduplicated images
  labels/
    labels/             # YOLO .txt files (mirrors images by name)
    images/             # LS export (for traceability)
    classes.txt         # class names
    notes.json          # class info
  augmentations/
    labels/
    images/
    classes.txt
    notes.json
  synth/
    rgb/
    depth/              # not used
    seg/
    labels/
  split/
    train/{images,labels}
    val/{images,labels}
    test/{images,labels}
  hw_submission/
scripts/
  01_clean_and_dedup.py
  02_label_studio.sh
  03_train_test_split.py
  04_augment_yolo.py
  05_unreal_capture.py
.venv/

Keep a small **class map** (e.g., ''labels.json'') to canonicalize names (''car'' not ''Car''). ====== Part 1: Sourcing public data ====== Goal: assemble **100–300 raw images** for 1–5 target classes. You can mix sources (Kaggle, Open Images, Roboflow Universe, academic repos). **Suggested sources (with licenses):** * [[https://www.kaggle.com/datasets|Kaggle Datasets]] - assorted image datasets (check each dataset’s license). * [[https://universe.roboflow.com/|Roboflow Universe]] - community datasets with convenient exports; verify terms per project. * [[https://huggingface.co/datasets|Hugging Face Datasets]] - comprehensive search engine for a vast selection of datasets (check licensing). * [[https://cocodataset.org/#home|MS COCO]] - detection/segmentation/keypoints (Creative Commons; see site for details). * [[https://www.cityscapes-dataset.com/|Cityscapes]] - urban scenes, semantic/instance segmentation. * [[http://www.cvlibs.net/datasets/kitti/|KITTI]] - autonomous driving benchmarks (detection/tracking). * [[https://paperswithcode.com/datasets?task=object-detection|Papers with Code - Datasets]] - searchable index by task. * [[https://huggingface.co/datasets?task_categories=task_categories:computer-vision|Hugging Face Datasets (vision)]] - many CV datasets with loaders. * [[https://data.gov/|Data.gov]] - US open data portal (images mixed in; check licensing). Checklist: * Verify **license** (CC0, CC-BY, etc.). Record **URL + license + date accessed** in ''DATA_CARD.md''. * Prefer diverse **scenes** (lighting, backgrounds, sizes, partial occlusions). * Save a **manifest** (CSV/JSON: url, license, attribution) alongside downloads.


mkdir -p data/raw
# Example: use dataset-specific CLI or curl/wget; store your manifest!

Do not commit raw copyrighted data unless the license permits. Use DVC or Git LFS for large files. ====== Part 2: Labeling with Label Studio (SAM2) ====== Start Label Studio locally and create a project with your taxonomy (e.g., ''car, pedestrian''). Use boxes or polygons.


label-studio start
# open http://localhost:8080

Steps: 1. Create project → define labels (consistent singular names). 2. Import images from data/raw/ (or data/clean/ after Part 3.1 if you prefer). 3. (Optional) Connect Studio ML backend using SAM2 for assisted masks. 4. Label a subset thoroughly. 5. Export as YOLO (or COCO) to data/labels/labelstudio_export/. Prefer polygons/masks for irregular shapes; use boxes if your model will be box-based. Keep it consistent across the set. ====== Part 3: Pre-processing pipeline ====== ==== 3.1 Cleaning & Deduplication ==== Remove corrupt files and near-duplicates to mitigate leakage and bias.


# scripts/01_clean_and_dedup.py
import cv2, os, hashlib, glob, shutil
from tqdm import tqdm
 
SRC, DST = "data/raw", "data/clean"
os.makedirs(DST, exist_ok=True)
seen = set()
 
for p in tqdm(glob.glob(f"{SRC}/**/*.*", recursive=True)):
    if not (p.lower().endswith((".jpg",".jpeg",".png",".bmp"))):
        continue
    img = cv2.imread(p)
    if img is None:
        continue
    ok, buf = cv2.imencode(".jpg", img, [int(cv2.IMWRITE_JPEG_QUALITY), 95])
    if not ok: 
        continue
    h = hashlib.md5(buf).hexdigest()
    if h in seen:
        continue
    seen.add(h)
    shutil.copy2(p, os.path.join(DST, os.path.basename(p)))
print("Kept:", len(seen))

==== 3.2 Convert/export to YOLO ==== Export labels from Label Studio (YOLO format). Launch Label Studio with the following command:


label-studio start

Example Labeling Interface config for use with [[https://ai.meta.com/sam2/|SAM2]]:

SAM2 generates ''brush labels'', which can not be directly used with YOLO's ''segmentation masks''! ==== 3.3 Stratified split (and group-by-source if possible) ==== Stratify by **labels present**; if you know the **source/scene**, group-split to avoid same-scene leakage.


# scripts/03_train_test_split.py
from sklearn.model_selection import train_test_split
import os, glob, shutil
 
IMAGES = "data/clean"
LABELS = "data/labels/yolo/labels"
OUT = "data/split"
 
def label_count(lbl):
    try:
        return sum(1 for _ in open(lbl, "r", encoding="utf-8"))
    except:
        return 0
 
imgs = sorted([p for p in glob.glob(f"{IMAGES}/*.*") if p.lower().endswith((".jpg",".jpeg",".png"))])
X, y = [], []
for ip in imgs:
    lp = os.path.join(LABELS, os.path.splitext(os.path.basename(ip))[0] + ".txt")
    X.append((ip, lp))
    y.append(0 if not os.path.exists(lp) or label_count(lp)==0 else 1)
 
def dump(split, items):
    for sub in ["images","labels"]:
        os.makedirs(f"{OUT}/{split}/{sub}", exist_ok=True)
    for ip, lp in items:
        shutil.copy2(ip, f"{OUT}/{split}/images/{os.path.basename(ip)}")
        dst = f"{OUT}/{split}/labels/{os.path.splitext(os.path.basename(ip))[0]}.txt"
        if os.path.exists(lp):
            shutil.copy2(lp, dst)
        else:
            open(dst,"w").close()  # empty labels if no objects
 
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
X_va, X_te, y_va, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=42)
 
dump("train", X_tr); dump("val", X_va); dump("test", X_te)
print("Split done.")

==== 3.4 Augmentations & Imbalance ==== Use label-preserving transforms to increase diversity. Target minority classes for class-balanced sampling.


# scripts/04_augment_yolo.py
import albumentations as A, cv2, os, glob
import numpy as np
 
AUG = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.35),
    A.ColorJitter(p=0.3),
    A.MotionBlur(blur_limit=5, p=0.2),
    A.Affine(scale=(0.5,1.5), translate_percent=(0.2,0.2), rotate=(-5,5), p=0.5),
], bbox_params=A.BboxParams(format="yolo", label_fields=["cls"]))
 
IN_IMG = "data/split/train/images"
IN_LBL = "data/split/train/labels"
OUT_IMG = "data/split/train/images_aug"
OUT_LBL = "data/split/train/labels_aug"
os.makedirs(OUT_IMG, exist_ok=True); os.makedirs(OUT_LBL, exist_ok=True)
 
def read_yolo(lbl_path, w, h):
    bxs, cls = [], []
    if os.path.exists(lbl_path):
        for line in open(lbl_path, "r").read().strip().splitlines():
            if not line: continue
            c, xc, yc, ww, hh = line.split()
            bxs.append([float(xc),float(yc),float(ww),float(hh)])
            cls.append(int(c))
    return bxs, cls
 
for ip in glob.glob(f"{IN_IMG}/*.*"):
    if not ip.lower().endswith((".jpg",".jpeg",".png")): 
        continue
    base = os.path.splitext(os.path.basename(ip))[0]
    lp = os.path.join(IN_LBL, base + ".txt")
    img = cv2.imread(ip); h, w = img.shape[:2]
    bxs, cls = read_yolo(lp, w, h)
    if len(bxs)==0: 
        pass#continue
    out = AUG(image=img, bboxes=bxs, cls=cls)
    aug_img, aug_bxs, aug_cls = out["image"], out["bboxes"], out["cls"]
    cv2.imwrite(os.path.join(OUT_IMG, base + "_aug.jpg"), aug_img)
    with open(os.path.join(OUT_LBL, base + "_aug.txt"), "w") as f:
        for c,(xc,yc,ww,hh) in zip(aug_cls, aug_bxs):
            f.write(f"{c} {xc:.6f} {yc:.6f} {ww:.6f} {hh:.6f}\n")

Strive for ≤3× ratio between most and least frequent classes in the **train** split. Use class-aware sampling when training. ====== Part 4: Synthetic data (UE 4.27 + UnrealCV → YOLO) ====== We’ll render small synthetic bursts to cover rare poses/backgrounds. You’ll need **UnrealCV** enabled and a minimal scene. **Domain randomization checklist:** * Lighting (HDRI/time-of-day/intensity), materials (color/roughness), camera FoV/pose jitter. * Distractors/backgrounds, slight sensor noise (JPEG, Gaussian), gentle lens distortion. * Scale/pose variety of target meshes.


# scripts/unreal_capture.py (pseudo client)
# scripts/05_unreal_capture.py

# UnrealCV dataset capture
# Outputs to: data/synth/{rgb, seg, labels} with optional backgrounds in data/backgrounds
# - Picks a camera that actually moves (A/B diff test)
# - Resolves target actor by prefix
# - Colors target UNIQUE_RGB to isolate in /object_mask
# - Spherical camera sampling + look-at
# - YOLO labels from binary mask

from __future__ import annotations
import os, sys, time, math, random, re
import numpy as np
import cv2
from PIL import Image

# ======== USER CONFIG (minimal) ========
PORT            = 9005
TARGET_PREFIX   = "bone_actor"   # e.g., 'bone_actor_20'
UNIQUE_RGB      = (255, 0, 255)  # color to paint target for mask extraction
NUM_IMAGES      = 20

IMG_W, IMG_H    = 1280, 960
FOV_DEG         = 60

# Camera randomization (keeps target in view)
DIST_RANGE      = (150, 600)
YAW_RANGE       = (-180, 180)
PITCH_RANGE     = (-25, 25)
ROLL_RANGE      = (-10, 10)

# Target jitter & rotation
TARGET_BASE_LOC = (0, 0, 100)
TARGET_JITTER   = (-5, 5)
RANDOMIZE_TARGET_ROT = True
TROT_PITCH      = (-10, 10)
TROT_YAW        = (0, 360)
TROT_ROLL       = (-10, 10)

# What to save
SAVE_LIT   = True     # saves lit render to data/synth/rgb/{id}.png
SAVE_MASK  = True     # saves binary mask to data/synth/seg/{id}_bin.png
SAVE_COMP  = True     # composite (lit over random background) to data/synth/rgb/{id}_comp.png
SAVE_YOLO  = True     # YOLO txt to data/synth/labels/{id}.txt
YOLO_CLASS_ID = 0

# ======== PROJECT PATHS (fit course skeleton) ========
PROJ_ROOT       = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
DATA_ROOT       = os.path.join(PROJ_ROOT, "data")
SYNTH_ROOT      = os.path.join(DATA_ROOT, "synth")
RGB_DIR         = os.path.join(SYNTH_ROOT, "rgb")
SEG_DIR         = os.path.join(SYNTH_ROOT, "seg")
LABELS_DIR      = os.path.join(SYNTH_ROOT, "labels")
PROBE_DIR       = os.path.join(SYNTH_ROOT, "_probes")
BACKGROUND_DIR  = os.path.join(DATA_ROOT, "backgrounds")  # optional folder for composites

def _ensure_dirs():
    for d in (RGB_DIR, SEG_DIR, LABELS_DIR, PROBE_DIR):
        os.makedirs(d, exist_ok=True)
    # backgrounds are optional

def _abs(*parts): return os.path.abspath(os.path.join(*parts))

def _list_backgrounds():
    if not os.path.isdir(BACKGROUND_DIR): return []
    exts = (".jpg", ".jpeg", ".png", ".bmp")
    return [f for f in os.listdir(BACKGROUND_DIR) if f.lower().endswith(exts)]

# ===== UnrealCV =====
try:
    import unrealcv
except ImportError:
    print("Missing 'unrealcv' package. Install it in the UE Python environment or your venv.", file=sys.stderr)
    sys.exit(1)

client = unrealcv.Client(('localhost', PORT))

# ===== Helpers =====
def vget_and_wait(cmd, out_path, timeout=3.0):
    out_path = _abs(out_path)
    client.request('%s %s' % (cmd, out_path))
    t0 = time.time()
    while time.time() - t0 < timeout:
        if os.path.isfile(out_path) and os.path.getsize(out_path) > 0:
            return True
        time.sleep(0.05)
    return False

def resolve_target_name(prefix):
    objs = client.request('vget /objects') or ""
    names = objs.split()
    if not names: return None
    pat = re.compile(r'^%s(?:_\d+)?$' % re.escape(prefix), re.IGNORECASE)
    exact = [n for n in names if pat.match(n)]
    if exact: return exact[0]
    starts = [n for n in names if n.lower().startswith(prefix.lower())]
    return starts[0] if starts else None

# ---------- Camera movement: pick a camera that REALLY moves ----------
def _save_probe(cam_token, xyz, pyr, path):
    x,y,z = xyz; pitch,yaw,roll = pyr
    client.request('vset /camera/%s/location %f %f %f' % (cam_token, x, y, z))
    client.request('vset /camera/%s/rotation %f %f %f' % (cam_token, pitch, yaw, roll))
    time.sleep(0.05)  # settle
    return vget_and_wait('vget /camera/%s/lit' % cam_token, path)

def _img_diff(a_path, b_path):
    a = cv2.imread(a_path); b = cv2.imread(b_path)
    if a is None or b is None: return 0.0
    if a.shape != b.shape:
        h = min(a.shape[0], b.shape[0]); w = min(a.shape[1], b.shape[1])
        a = a[:h,:w]; b = b[:h,:w]
    diff = cv2.absdiff(a, b)
    return float(diff.mean())

def pick_movable_camera():
    client.request('vset /cameras/spawn')
    time.sleep(0.2)

    tokens = (client.request('vget /cameras') or "").split()
    nums = [t for t in tokens if t.isdigit()]
    candidates = nums + [t for t in tokens if t not in nums] + [str(i) for i in range(8)]
    seen = set()
    candidates = [t for t in candidates if not (t in seen or seen.add(t))]

    A_loc = (  0.0,   0.0, 300.0); A_rot = ( -15.0,   0.0,   0.0)
    B_loc = (500.0, 200.0, 150.0); B_rot = ( -10.0, 140.0,   0.0)

    for cam in candidates:
        pA = _abs(PROBE_DIR, "_probeA_%s.png" % cam)
        pB = _abs(PROBE_DIR, "_probeB_%s.png" % cam)
        okA = _save_probe(cam, A_loc, A_rot, pA)
        okB = _save_probe(cam, B_loc, B_rot, pB)
        if not (okA and okB):
            for p in (pA,pB):
                try: os.remove(p)
                except: pass
            continue
        d = _img_diff(pA, pB)
        for p in (pA,pB):
            try: os.remove(p)
            except: pass
        if d > 1.0:
            print("[INFO] Movable camera selected:", cam, "(diff=%.2f)" % d)
            return cam
    return None

# ---------- Mask / Composite / YOLO ----------
def isolate_color_mask(mask_rgb_path, rgb):
    img_bgr = cv2.imread(mask_rgb_path)
    if img_bgr is None: return None
    img = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
    r,g,b = rgb
    lower = np.array([r,g,b], dtype=np.uint8)
    upper = np.array([r,g,b], dtype=np.uint8)
    return cv2.inRange(img, lower, upper)

def composite_with_bg(bg_files, lit_path, mask_bin_path, out_path):
    if not bg_files:
        # no backgrounds -> copy lit
        Image.open(lit_path).save(out_path); return
    imgI = Image.open(lit_path).convert("RGB")
    # open a random background (robust to broken files)
    while True:
        try:
            from_path = _abs(BACKGROUND_DIR, random.choice(bg_files))
            bg = Image.open(from_path).convert("RGB")
            break
        except Exception:
            if not bg_files: break
            continue
    bg = bg.resize(imgI.size)
    m = Image.open(mask_bin_path).convert("L")
    out = Image.composite(imgI, bg, m)
    out.save(out_path)

def yolo_from_mask(mask_bin, img_w, img_h, cls_id=0):
    if mask_bin is None: return None
    x, y, w, h = cv2.boundingRect(mask_bin)
    if w == 0 or h == 0: return None
    cx = (x + w/2.0) / float(img_w)
    cy = (y + h/2.0) / float(img_h)
    nw = w / float(img_w)
    nh = h / float(img_h)
    return np.array([[cls_id, cx, cy, nw, nh]], dtype=np.float32)

# ===== Main =====
def main():
    _ensure_dirs()
    backgrounds = _list_backgrounds()

    # Connect
    client.connect()
    if not client.isconnected():
        print('UnrealCV server is not running. Start PIE or packaged game with UnrealCV plugin.')
        sys.exit(-1)

    print(client.request('vget /unrealcv/status'))

    # Pick a camera that actually moves
    cam = pick_movable_camera()
    if cam is None:
        print("[ERROR] Could not find a movable camera. In UE PIE Output Log, run: vset /cameras/spawn")
        client.disconnect(); sys.exit(1)

    # Camera setup
    client.request('vset /camera/%s/size %d %d' % (cam, IMG_W, IMG_H))
    client.request('vset /camera/%s/fov %d' % (cam, FOV_DEG))

    # Resolve target actor name from prefix
    target = resolve_target_name(TARGET_PREFIX)
    if not target:
        print("[ERROR] Could not resolve target with prefix '%s'." % TARGET_PREFIX)
        print("Sample objects:", (client.request('vget /objects') or "")[:500])
        client.disconnect(); sys.exit(1)
    print("[INFO] Target resolved:", target)

    # Paint target for mask isolation
    r,g,b = UNIQUE_RGB
    client.request('vset /object/%s/color %d %d %d' % (target, r, g, b))
    time.sleep(0.2)

    for i in range(NUM_IMAGES):
        print("IMAGE:", i)

        # Randomize target
        tx = TARGET_BASE_LOC[0] + random.uniform(*TARGET_JITTER)
        ty = TARGET_BASE_LOC[1] + random.uniform(*TARGET_JITTER)
        tz = TARGET_BASE_LOC[2] + random.uniform(*TARGET_JITTER)
        client.request('vset /object/%s/location %f %f %f' % (target, tx, ty, tz))

        if RANDOMIZE_TARGET_ROT:
            pt = random.uniform(*TROT_PITCH)
            yt = random.uniform(*TROT_YAW)
            rt = random.uniform(*TROT_ROLL)
            client.request('vset /object/%s/rotation %f %f %f' % (target, pt, yt, rt))

        # Sample camera pose around target
        dist  = random.uniform(*DIST_RANGE)
        yaw   = random.uniform(*YAW_RANGE)
        pitch = random.uniform(*PITCH_RANGE)
        roll  = random.uniform(*ROLL_RANGE)

        cx = tx + dist * math.cos(math.radians(pitch)) * math.cos(math.radians(yaw))
        cy = ty + dist * math.cos(math.radians(pitch)) * math.sin(math.radians(yaw))
        cz = tz + dist * math.sin(math.radians(pitch))

        dx, dy, dz = (tx - cx, ty - cy, tz - cz)
        cam_yaw   = math.degrees(math.atan2(dy, dx))
        hyp       = math.sqrt(dx*dx + dy*dy)
        cam_pitch = math.degrees(math.atan2(dz, hyp))

        client.request('vset /camera/%s/location %f %f %f' % (cam, cx, cy, cz))
        client.request('vset /camera/%s/rotation %f %f %f' % (cam, cam_pitch, cam_yaw, roll))

        # Paths inside data/synth
        stem = f"{i:06d}"
        img_lit = _abs(RGB_DIR, f"{stem}.png")          # lit
        img_mask_rgb = _abs(SEG_DIR, f"{stem}_mask.png")# object_mask (RGB)
        img_mask_bin = _abs(SEG_DIR, f"{stem}_bin.png") # binary
        img_comp = _abs(RGB_DIR, f"{stem}_comp.png")    # composite
        yolo_txt = _abs(LABELS_DIR, f"{stem}.txt")      # YOLO

        # Capture
        if SAVE_LIT and not vget_and_wait('vget /camera/%s/lit' % cam, img_lit):
            print("[WARN] lit not saved - skipping frame"); continue
        if (SAVE_MASK or SAVE_COMP or SAVE_YOLO) and not vget_and_wait('vget /camera/%s/object_mask' % cam, img_mask_rgb):
            print("[WARN] object_mask not saved - skipping frame"); continue

        # Mask isolate
        mask = isolate_color_mask(img_mask_rgb, UNIQUE_RGB)
        if mask is None:
            print("[WARN] failed to read mask image - skipping frame"); continue
        mask_bin = np.where(mask > 0, 255, 0).astype(np.uint8)
        if SAVE_MASK:
            cv2.imwrite(img_mask_bin, mask_bin)

        # YOLO
        if SAVE_YOLO:
            yolo = yolo_from_mask(mask_bin, IMG_W, IMG_H, cls_id=YOLO_CLASS_ID)
            if yolo is not None:
                np.savetxt(yolo_txt, yolo, fmt="%.0f %.6f %.6f %.6f %.6f")
            else:
                open(yolo_txt, 'w').close()

        # Composite
        if SAVE_COMP and SAVE_LIT:
            composite_with_bg(backgrounds, img_lit, img_mask_bin, img_comp)

    client.disconnect()
    print("Done. Outputs in:", SYNTH_ROOT)

if __name__ == "__main__":
    main()

Result: {{ :courses:becm33mle:tutorials:unrealcv_output.png?nolink&600 |}} Keep synthetic and real data **distinguishable** in your project (e.g., ''source: real/synth''). ====== Useful links ====== * Label Studio: https://labelstud.io/guide/ * Albumentations: https://albumentations.ai/docs/ * UnrealCV: http://unrealcv.org/ * YOLO formats (Ultralytics): https://docs.ultralytics.com/datasets/segment/ (see Detection/Seg variants) ====== HW03: 50-sample dataset (5p, due in 1 week) ====== Submit via BRUTE a folder ''data/hw_submission/'' containing: * **50 dataset samples** (i.e. images + labels) * **.txt file** - includes a very short description of the dataset and the data.