Building an Automated Hitomi Gallery Sync System
The Problem
Manual workflow for downloading from hitomi.la: copy artist name, go to the site, type it in the search bar, click the autocomplete suggestion, hit enter, copy the URL, paste into the downloader queue. Repeat for every artist and group. No way to check for new uploads without redoing the entire ritual.
With 77 tracked artists/groups, this was unsustainable.
Existing Infrastructure
Three-tier Node.js backend already running as systemd user services:
- Downloader (port 29748) — Express + WebSocket queue manager wrapping gallery-dl (Python). Manages sequential downloads with marker files to hide incomplete galleries from the indexer.
- Indexer (port 29750) — In-memory search engine over downloaded galleries. Scans
~/Pictures/gallery-dl/hitomi/via a compiled Go binary that parsesinfo.jsonmetadata files. - Streamer (port 29749) — Reverse proxy + sprite generator. Serves the SvelteKit frontend, proxies to indexer/downloader, generates WebP thumbnail strips on demand.
gallery-dl stores a download archive in ~/Pictures/gallery-dl/hitomi.sqlite3 — a single table (archive) with entries like hitomi{gallery_id}_{image_num}. 1,922 galleries (373K image entries) at the start of this session.
Step 1: Understanding the Search — Dry-Run Tag Analysis
Before building anything, needed to understand how hitomi’s search works and whether filtering by tags was viable. Built a dry-run script that:
- Used hitomi’s nozomi binary indexes to resolve a search query to gallery IDs
- Fetched metadata for each gallery in parallel (10 threads)
- Collected all tags into a CSV report
Tested with language:japanese female:cheating -female:mother: 11,190 galleries, 1,143 unique tags. Opened in LibreOffice to browse. Conclusion: too many tags to filter effectively — better to just track specific artists/groups.
Key discovery: no server-side limit on negative filters. The search is entirely client-side — fetch nozomi index files (binary arrays of 4-byte big-endian gallery IDs), do set intersection/difference locally. Each negative term is just one more HTTP request for an index file.
Step 2: The Bottleneck — Why gallery-dl Is Slow for Checking
When gallery-dl processes a search URL, it fetches metadata for every matching gallery — even already-downloaded ones — before checking the archive. Per gallery:
- Fetch
ltn.gold-usergeneratedcontent.net/galleries/{id}.js— metadata - Fetch
gg.js— CDN routing table - Resolve image URLs
- Check each image against the SQLite archive
- Skip if already downloaded
For an artist with 50 galleries where all are downloaded, that’s still 50+ HTTP requests just to discover nothing’s new. Multiply by 77 artists and a daily check becomes thousands of wasted requests.
Step 3: Key Insight — Language-Specific Nozomi Indexes
First attempt at the sync script used the search extractor’s approach: fetch language:japanese nozomi (523,786 IDs!), fetch artist:X nozomi, intersect client-side. This worked but was wasteful — the massive language index download dominated the runtime (5.1 seconds for just 2 entries).
Then realized: hitomi already has pre-filtered, language-specific indexes for every artist and group:
https://ltn.gold-usergeneratedcontent.net/artist/dokuneko noil-japanese.nozomi
https://ltn.gold-usergeneratedcontent.net/group/hokkyoku hotaru-japanese.nozomi
These are the same indexes used by hitomi’s tag pages (e.g., hitomi.la/artist/dokuneko_noil-japanese.html). One HTTP request per artist returns exactly the Japanese gallery IDs we need. No intersection required.
This dropped the check for 2 entries from 5.1s to 1.3s. For 77 entries: 45 seconds total.
Step 4: Building sync.py
The sync script (hitomi-backend/sync.py):
- Reads
artists.txt— one entry per line (artist:nameorgroup:name) - Loads all archived gallery IDs from SQLite into a set (1,922 entries, instant)
- For each entry, fetches the language-specific nozomi index (1 HTTP request)
- Decodes binary:
struct.unpack(f'>{count}I', data) - Set difference against archived IDs
- With
--queue: POSTs only new gallery URLs to the downloader
# artists.txt — defaults to japanese
artist:dokuneko_noil
group:hokkyoku_hotaru
language:korean artist:some_name # override language
Performance Comparison
| Approach | HTTP requests for 77 entries (7,063 galleries) | Time |
|---|---|---|
| gallery-dl checking everything | ~7,063 (1 per gallery for metadata) | many minutes |
| sync.py with language:japanese intersect | 79 (77 + 2 for language index) | ~3 minutes |
| sync.py with language-specific nozomi | 77 (1 per entry) | 45 seconds |
Step 5: Validating 77 Artist/Group Names
Not all names are artists — some are groups (circles). Hitomi has separate indexes. A test script tried both artist/{name}-japanese.nozomi and group/{name}-japanese.nozomi for each name. Results:
- 404 on one, data on the other → clear choice (72 names)
- Both exist → pick whichever has more galleries (5 names: hokkyoku_hotaru, nabe, takaya, sasanoha_toro, gabugabu)
- 77/77 resolved successfully
Output: a validated artists.txt with correct artist: or group: prefixes for each entry.
Step 6: Deployment Surprise — 413 Payload Too Large
First production run queued 5,423 new galleries. The POST /queue with all URLs failed:
ERROR posting to downloader: HTTP Error 413: Payload Too Large
Express’s body-parser defaults to 100KB. 5,423 URLs as newline-separated text is ~250KB. Fix: bodyParser.json({ limit: '5mb' }).
Rebuilt downloader, restarted service, re-ran sync. Queued successfully.
Step 7: Confirming Downloads Work
Verified the pipeline end-to-end:
$ curl -sk https://localhost:29748/status | python3 -c "..."
Running: True
Current: https://hitomi.la/galleries/2367319.html
Queue: 5003 remaining
Gallery 3733668 Giri no Haha appeared in ~/Pictures/gallery-dl/hitomi/ with 35 images + thumbnails + info.json.
Are Logs a Bottleneck?
No. The downloader emits gallery-dl stdout/stderr via Socket.IO. With no browser clients connected (the normal case during automated sync), io.emit() to zero clients is a no-op. No serialization, no buffering, no disk I/O. The node process doesn’t even log download output to the systemd journal.
The only bottleneck is image downloads — pure network I/O. gallery-dl downloads images sequentially (one gallery at a time, images sequential within), with a fast SQLite archive check per image.
Step 8: Systemd Timer for Daily Automation
# ~/.config/systemd/user/hitomi-sync.timer
[Timer]
OnCalendar=*-*-* 04:00:00
Persistent=true
Persistent=true catches up on missed runs if the machine was off at 4 AM. The service is Type=oneshot — runs sync.py –queue and exits.
# Check next scheduled run
systemctl --user list-timers hitomi-sync.timer
# Manual trigger
systemctl --user start hitomi-sync.service
# View past run logs
journalctl --user -u hitomi-sync.service
# Dry-run (preview only, no downloads)
gallery-dl/.venv/bin/python3 sync.py
Architecture Summary
artists.txt (77 entries)
│
▼
sync.py (daily at 04:00)
│ 1. Read entries
│ 2. Fetch nozomi index per entry (77 HTTP reqs, 45s)
│ 3. Diff against SQLite archive
│ 4. POST new gallery URLs
▼
Downloader (port 29748)
│ Queue manager spawns gallery-dl per gallery
│ Marker files hide in-progress from indexer
▼
~/Pictures/gallery-dl/hitomi/{id} {title}/
├── images + thumbnails
├── info.json
└── archive entry in hitomi.sqlite3
Files Changed
hitomi-backend/sync.py— the sync script (no dependencies beyond stdlib)hitomi-backend/artists.txt— 77 tracked artists/groupshitomi-backend/downloader/src/main.ts— bumped body-parser limit to 5mbhitomi-backend/downloader/src/config.ts— relative paths instead of hardcoded~/.config/systemd/user/hitomi-sync.service— oneshot service~/.config/systemd/user/hitomi-sync.timer— daily timer
Lessons
- Profile before optimizing. The initial assumption was “maybe logs are slow.” The actual bottleneck was gallery-dl fetching metadata for already-downloaded galleries.
- Use the server’s pre-computed indexes. Hitomi already has language-specific nozomi files per artist — no need to download the full language index and intersect client-side.
- Test names before trusting them. Artist vs group is a real distinction in hitomi’s data model. Automated validation caught all 77 correctly.
- Payload limits bite at scale. 100 URLs in a POST is fine. 5,000 hits Express’s default 100KB limit.