visar.log
Technical notes from building things
← all posts

Building an Automated Hitomi Gallery Sync System

The Problem

Manual workflow for downloading from hitomi.la: copy artist name, go to the site, type it in the search bar, click the autocomplete suggestion, hit enter, copy the URL, paste into the downloader queue. Repeat for every artist and group. No way to check for new uploads without redoing the entire ritual.

With 77 tracked artists/groups, this was unsustainable.

Existing Infrastructure

Three-tier Node.js backend already running as systemd user services:

  • Downloader (port 29748) — Express + WebSocket queue manager wrapping gallery-dl (Python). Manages sequential downloads with marker files to hide incomplete galleries from the indexer.
  • Indexer (port 29750) — In-memory search engine over downloaded galleries. Scans ~/Pictures/gallery-dl/hitomi/ via a compiled Go binary that parses info.json metadata files.
  • Streamer (port 29749) — Reverse proxy + sprite generator. Serves the SvelteKit frontend, proxies to indexer/downloader, generates WebP thumbnail strips on demand.

gallery-dl stores a download archive in ~/Pictures/gallery-dl/hitomi.sqlite3 — a single table (archive) with entries like hitomi{gallery_id}_{image_num}. 1,922 galleries (373K image entries) at the start of this session.

Step 1: Understanding the Search — Dry-Run Tag Analysis

Before building anything, needed to understand how hitomi’s search works and whether filtering by tags was viable. Built a dry-run script that:

  1. Used hitomi’s nozomi binary indexes to resolve a search query to gallery IDs
  2. Fetched metadata for each gallery in parallel (10 threads)
  3. Collected all tags into a CSV report

Tested with language:japanese female:cheating -female:mother: 11,190 galleries, 1,143 unique tags. Opened in LibreOffice to browse. Conclusion: too many tags to filter effectively — better to just track specific artists/groups.

Key discovery: no server-side limit on negative filters. The search is entirely client-side — fetch nozomi index files (binary arrays of 4-byte big-endian gallery IDs), do set intersection/difference locally. Each negative term is just one more HTTP request for an index file.

When gallery-dl processes a search URL, it fetches metadata for every matching gallery — even already-downloaded ones — before checking the archive. Per gallery:

  1. Fetch ltn.gold-usergeneratedcontent.net/galleries/{id}.js — metadata
  2. Fetch gg.js — CDN routing table
  3. Resolve image URLs
  4. Check each image against the SQLite archive
  5. Skip if already downloaded

For an artist with 50 galleries where all are downloaded, that’s still 50+ HTTP requests just to discover nothing’s new. Multiply by 77 artists and a daily check becomes thousands of wasted requests.

Step 3: Key Insight — Language-Specific Nozomi Indexes

First attempt at the sync script used the search extractor’s approach: fetch language:japanese nozomi (523,786 IDs!), fetch artist:X nozomi, intersect client-side. This worked but was wasteful — the massive language index download dominated the runtime (5.1 seconds for just 2 entries).

Then realized: hitomi already has pre-filtered, language-specific indexes for every artist and group:

https://ltn.gold-usergeneratedcontent.net/artist/dokuneko noil-japanese.nozomi
https://ltn.gold-usergeneratedcontent.net/group/hokkyoku hotaru-japanese.nozomi

These are the same indexes used by hitomi’s tag pages (e.g., hitomi.la/artist/dokuneko_noil-japanese.html). One HTTP request per artist returns exactly the Japanese gallery IDs we need. No intersection required.

This dropped the check for 2 entries from 5.1s to 1.3s. For 77 entries: 45 seconds total.

Step 4: Building sync.py

The sync script (hitomi-backend/sync.py):

  1. Reads artists.txt — one entry per line (artist:name or group:name)
  2. Loads all archived gallery IDs from SQLite into a set (1,922 entries, instant)
  3. For each entry, fetches the language-specific nozomi index (1 HTTP request)
  4. Decodes binary: struct.unpack(f'>{count}I', data)
  5. Set difference against archived IDs
  6. With --queue: POSTs only new gallery URLs to the downloader
# artists.txt — defaults to japanese
artist:dokuneko_noil
group:hokkyoku_hotaru
language:korean artist:some_name   # override language

Performance Comparison

Approach HTTP requests for 77 entries (7,063 galleries) Time
gallery-dl checking everything ~7,063 (1 per gallery for metadata) many minutes
sync.py with language:japanese intersect 79 (77 + 2 for language index) ~3 minutes
sync.py with language-specific nozomi 77 (1 per entry) 45 seconds

Step 5: Validating 77 Artist/Group Names

Not all names are artists — some are groups (circles). Hitomi has separate indexes. A test script tried both artist/{name}-japanese.nozomi and group/{name}-japanese.nozomi for each name. Results:

  • 404 on one, data on the other → clear choice (72 names)
  • Both exist → pick whichever has more galleries (5 names: hokkyoku_hotaru, nabe, takaya, sasanoha_toro, gabugabu)
  • 77/77 resolved successfully

Output: a validated artists.txt with correct artist: or group: prefixes for each entry.

Step 6: Deployment Surprise — 413 Payload Too Large

First production run queued 5,423 new galleries. The POST /queue with all URLs failed:

ERROR posting to downloader: HTTP Error 413: Payload Too Large

Express’s body-parser defaults to 100KB. 5,423 URLs as newline-separated text is ~250KB. Fix: bodyParser.json({ limit: '5mb' }).

Rebuilt downloader, restarted service, re-ran sync. Queued successfully.

Step 7: Confirming Downloads Work

Verified the pipeline end-to-end:

$ curl -sk https://localhost:29748/status | python3 -c "..."
Running: True
Current: https://hitomi.la/galleries/2367319.html
Queue: 5003 remaining

Gallery 3733668 Giri no Haha appeared in ~/Pictures/gallery-dl/hitomi/ with 35 images + thumbnails + info.json.

Are Logs a Bottleneck?

No. The downloader emits gallery-dl stdout/stderr via Socket.IO. With no browser clients connected (the normal case during automated sync), io.emit() to zero clients is a no-op. No serialization, no buffering, no disk I/O. The node process doesn’t even log download output to the systemd journal.

The only bottleneck is image downloads — pure network I/O. gallery-dl downloads images sequentially (one gallery at a time, images sequential within), with a fast SQLite archive check per image.

Step 8: Systemd Timer for Daily Automation

# ~/.config/systemd/user/hitomi-sync.timer
[Timer]
OnCalendar=*-*-* 04:00:00
Persistent=true

Persistent=true catches up on missed runs if the machine was off at 4 AM. The service is Type=oneshot — runs sync.py –queue and exits.

# Check next scheduled run
systemctl --user list-timers hitomi-sync.timer

# Manual trigger
systemctl --user start hitomi-sync.service

# View past run logs
journalctl --user -u hitomi-sync.service

# Dry-run (preview only, no downloads)
gallery-dl/.venv/bin/python3 sync.py

Architecture Summary

artists.txt (77 entries)
sync.py (daily at 04:00)
1. Read entries
2. Fetch nozomi index per entry (77 HTTP reqs, 45s)
3. Diff against SQLite archive
4. POST new gallery URLs
Downloader (port 29748)
    │  Queue manager spawns gallery-dl per gallery
    │  Marker files hide in-progress from indexer
~/Pictures/gallery-dl/hitomi/{id} {title}/
    ├── images + thumbnails
    ├── info.json
    └── archive entry in hitomi.sqlite3

Files Changed

  • hitomi-backend/sync.py — the sync script (no dependencies beyond stdlib)
  • hitomi-backend/artists.txt — 77 tracked artists/groups
  • hitomi-backend/downloader/src/main.ts — bumped body-parser limit to 5mb
  • hitomi-backend/downloader/src/config.ts — relative paths instead of hardcoded
  • ~/.config/systemd/user/hitomi-sync.service — oneshot service
  • ~/.config/systemd/user/hitomi-sync.timer — daily timer

Lessons

  1. Profile before optimizing. The initial assumption was “maybe logs are slow.” The actual bottleneck was gallery-dl fetching metadata for already-downloaded galleries.
  2. Use the server’s pre-computed indexes. Hitomi already has language-specific nozomi files per artist — no need to download the full language index and intersect client-side.
  3. Test names before trusting them. Artist vs group is a real distinction in hitomi’s data model. Automated validation caught all 77 correctly.
  4. Payload limits bite at scale. 100 URLs in a POST is fine. 5,000 hits Express’s default 100KB limit.