Admin access required

Site Map

Architecture diagrams for the Baseline Labs platform. Click a tab to jump to a section.

Product Architecture

The recursive split pattern: core/ holds shared infrastructure, each product is independently deletable.

core/ — shared infrastructure api.baselinelabs.ai

shared/ db, config, errors, r2, gpu_pool

api/ auth, admin, log

templates/ shell, nav, footer, head

pages/ login, account, admin/*

scraper/ httpx + Playwright crawler, domain queue

gpu/ Vast.ai pool, vLLM inference workers

provides db provides auth provides config provides templates provides error logging

markupschema/ www.markupschema.com

api/ schema, schema_generator, scrape, inference, rolling, vastai, blog, docs

pages/ home, dashboard, schema-generator, rolling-schema, getting-started, pricing, blog, docs

admin/ gpu-monitor, scraper-monitor, schema-playground, rolling-monitor

consumes scraper consumes gpu consumes R2

geo/ baselinelabs.ai

api/ search, parse, reports, brand_reports, backlinks, mentions, analytics, keywords, screenshots, email

pages/ home, dashboard, getting-started, reports, templates, brand-reports, brand-templates, backlinks, analytics, ranking, analysis, mention-scan, resources, about-us, faq, contact

admin/ errors

consumes scraper consumes DataForSEO consumes GPT / Gemini / Perplexity

Extraction test: take every product/ dir at every depth + every shared/ dir above it. The other product + shared infra must still work.

Docker Services

Six containers on a single VPS. Caddy terminates TLS, two app pools split fast/heavy traffic.

Internet

Ports 80, 443 (HTTPS via 3 domains) + 8080 (scraper direct IP)

caddy caddy:2-alpine

TLS termination, reverse proxy for 3 domains: www.markupschema.com (MS), baselinelabs.ai (GEO), api.baselinelabs.ai (Core). Routes scrape/inference to bulk, everything else to app. Static asset caching (1hr).

fast → :8080 heavy → :8081

app :8080

uvicorn, 4 workers. Starts first: compiles pages, runs migrations. Serves all pages, auth, and fast API endpoints. Health check on /api/core/health.

app-bulk :8081

uvicorn, 2 workers. Waits for app healthy. Handles scrape submit/claim/complete, inference, and direct :8080 scraper traffic.

postgres postgres:16-alpine

Database: baseline. Schemas: shared, ms, geo. Volume: pgdata.

redis redis:7-alpine

128 MB, allkeys-lru. No persistence (appendonly no). Job queue + caching.

backup

pg_dump -Fc every hour. 7-day retention. Saves to ./data/.

Caddy domain routing

markupschema.com /api/ms/*, /api/core/* → app | scrape/inference → bulk | clean URLs rewrite to /ms/*

baselinelabs.ai /api/geo/*, /api/core/* → app | /admin/* → /geo/admin/* | clean URLs rewrite to /geo/*

api.baselinelabs.ai /api/core/*, /api/geo/*, /api/ms/* → app | core pages + all admin

:8080 direct IP scraper/GPU workers → bulk :8081

app depends on postgres + redis (service_started). app-bulk depends on app (service_healthy) + postgres + redis. caddy depends on app + app-bulk.

Network Topology

WireGuard mesh connecting mainframe, scraper, and GPU workers. Cloudflare R2 for stateless object storage.

Mainframe 10.10.0.1

Hetzner CX22

FastAPI (app :8080 + app-bulk :8081)
Caddy (80, 443, 8080)
PostgreSQL 16
Redis 7
Backup (pg_dump hourly)

Scraper 10.10.1.1

Hetzner CX53

httpx + Playwright
100 domain capacity
Polite delays (robots.txt)

GPU Workers 10.10.2.x

Vast.ai spot instances

Qwen3-VL-4B vLLM fp8
Chisel reverse tunnel
Watchdog auto-respawn

Cloudflare R2

Cleaned HTML from scraper
Screenshots
Schema JSON-LD output

scraper → mainframe :8080 direct IP (polls for jobs) scraper → R2 (uploads cleaned HTML) GPU → mainframe via Chisel tunnel 9090→8080 GPU ↔ R2 (downloads HTML, uploads schemas) WireGuard mesh 10.10.0.0/24

All workers are stateless. R2 is the source of truth for crawled content and inference output. Chisel tunnel overhead ~1ms.

Data Flow — Schema Generation

End-to-end flow from URL submission to schema.org JSON-LD output.

User submits URL

User → app :8080 → PostgreSQL

Check schema cache in ms.schema_jobs. If cached and fresh, return immediately.

Cache miss: Insert into shared.scrape_queue (status=pending, trigger_inference=true), return 202 Accepted with job_id.

Scraper crawls page

Scraper (10.10.1.1) → app-bulk :8081 → PostgreSQL + R2

Scraper polls /api/ms/scrape/next-job via direct IP :8080. Claims job atomically (status=claimed). Fetches page with httpx + Playwright. Uploads cleaned HTML to R2. Reports complete — scrape_queue marked done, pushed to ms.inference_queue via Redis.

GPU runs inference

GPU Worker (10.10.2.x) → Redis → R2 → app-bulk :8081 → PostgreSQL

GPU worker pops from ms.inference_queue. Downloads cleaned HTML from R2. Runs Qwen3-VL-4B vLLM fp8 inference (~1.5s small pages, ~17s large). Uploads schema JSON-LD to R2. Reports complete — logged to shared.inference_log, schema_jobs updated.

User polls for result

User → app :8080 → PostgreSQL

Check ms.schema_jobs status. When complete, return schema URL pointing to R2 result_path.

Domain-aware crawling with polite delays (min 1s, default 2s from robots.txt). Watchdog auto-respawns GPU workers on Vast.ai spot preemption.

Database Schema

One PostgreSQL database (baseline), three schemas. All code uses schema-qualified table names. 31 tables total.

shared.* — 13 tables, cross-product

users

id PK
email UK
google_id UK
name, role, password_hash
is_admin, tier
avatar_path
created_at, last_login

api_keys

id PK
user_id FK→users
key_hash UK
name, rate_limit, active
created_at, last_used

scrape_queue

id PK
url, url_hash UK
status, domain, priority
html_path, html_hash
screenshot_path
source, trigger_inference
error, retry_count
metadata jsonb

inference_log

id PK
url, html_hash
result_path
inference_ms
error
created_at, completed_at

workers

id PK
worker_type, hostname, ip
status, concurrency
jobs_completed, jobs_failed
cpu_percent, memory_mb
gpu_type, gpu_util_percent
last_heartbeat
metadata jsonb

error_logs

id PK
service, level, product
message, stack_trace
request_path, request_method
user_id FK→users
resolved, resolved_at, resolved_by
metadata jsonb

request_logs

id PK
method, path, status_code
response_ms, product
user_id FK→users
ip, timestamp

usage_logs

id PK
api_key_id FK→api_keys
endpoint, status_code
response_ms, cache_hit
timestamp

scrape_passes

id PK
urls_at_start, domains_at_start
urls_completed, urls_harvested
started_at, completed_at

async_operations

id PK
operation_id, operation_type
user_id FK→users
save_id FK→geo_saves
status, progress, current_step
metadata jsonb

subscriptions

id PK
user_id FK→users
stripe_subscription_id
category, plan, status
current_period_start/end

sms_messages

id PK
from_number, to_number
body, direction
twilio_sid
created_at

oauth_codes

id PK
code UK
user_id FK→users
client_id, redirect_uri
scope, expires_at

ms.* — 4 tables, MarkupSchema only

inference_queue

id PK
url, html_hash, html_path
status, worker_id, priority
screenshot_path, result_path
domain, input_mode
error, retry_count
callback_data jsonb

schema_jobs

id PK
user_id FK→users
domain, status
pages_discovered, pages_scraped
pages_generated
urls jsonb
error

rolling_domains

id PK
user_id FK→users
domain UK w/ user
status, pages_per_hour
sitemap_url, last_sitemap_fetch
created_at

rolling_pages

id PK
domain_id FK→rolling_domains
url UK
last_html_hash, last_schema_at
schema_path, status
error

geo.* — 16 tables, GEO Studio only

geo_saves

id PK
user_id FK→users
name, primary_site
business_info jsonb
created_at, updated_at

geo_queries

id PK
user_id FK→users
save_id FK→saves
query_text, search_type
results jsonb
position

report_templates

id PK
user_id FK→users
save_id FK→saves
name, description
queries jsonb

reports

id PK
user_id FK→users
save_id FK→saves
template_id FK→templates
status, engines jsonb

report_query_status

id PK
report_id FK→reports
query_text, engine
status, position, error

brand_report_templates

id PK
user_id FK→users
save_id FK→saves
business_name, industry
products, competitors jsonb

backlinks

id PK
source_url, source_domain
target_url, target_domain
anchor_text, rel, position
first_seen, last_seen
times_seen

mention_profiles

id PK
user_id FK→users
save_id FK→saves
business_name, industry
products, competitors jsonb

mention_scans

id PK
user_id FK→users
profile_id FK→profiles
status, total_queries
completed_queries, mention_count
overall_sentiment_score

mention_queries

id PK
scan_id FK→scans
query_text, layer, intent
status, result_count

mention_raw_results

id PK
scan_id FK→scans
query_id FK→queries
url, title, snippet
rank_in_query, source_domain

mention_records

id PK
scan_id FK→scans
source_url, source_type
sentiment_score, sentiment_label
relevance_score

mention_summaries

id PK
scan_id FK→scans UK
overview jsonb
source_map jsonb
executive_summary jsonb
recommendations jsonb

rankings

id PK
query_id FK→queries
engine, position
url, title

ai_analytics

id PK
user_id FK→users
event_type, feature
metadata jsonb

email_queue

id PK
user_id FK→users
template, to_email
status, sent_at

shared.users → api_keys, usage_logs, error_logs, request_logs (all ON DELETE CASCADE)

shared.users → ms.schema_jobs (user's schema generation jobs)

shared.users → geo.saves, queries, keywords, reports, mention_profiles, mention_scans (all ON DELETE CASCADE)

geo.saves → queries, keywords, reports, mention_profiles, async_operations (all ON DELETE CASCADE)

geo.mention_profiles → scans → queries → raw_results, records (cascade chain)

geo.mention_scans → mention_summaries (1:1, ON DELETE CASCADE)

shared.* = cross-product (13) | ms.* = MarkupSchema only (4) | geo.* = GEO only (16) | All queries use schema-qualified names

Pages + APIs

Three domains, each serving one product context. Pages are config.json + content.html compiled into the shell. APIs are auto-discovered FastAPI routers. Caddy rewrites clean URLs to internal /ms/* and /geo/* paths.

api.baselinelabs.ai

Pages

PAGE /login

PAGE /account

PAGE /console/dashboard

PAGE /console/api-keys

PAGE /console/mcp

Admin pages

PAGE /admin/dashboard

PAGE /admin/errors

PAGE /admin/users

PAGE /admin/request-logs

PAGE /admin/reports

PAGE /admin/docs

PAGE /admin/log

PAGE /admin/metadata

PAGE /admin/sms

PAGE /admin/editor

PAGE /admin/tests

PAGE /admin/mainframe

PAGE /admin/site-map

APIs — /api/core/

API auth

API oauth

API admin

API billing

API sms

API editor

API log

API testing

API mcp

www.markupschema.com

Pages

PAGE /home

PAGE /dashboard

PAGE /schema-generator

PAGE /rolling-schema

PAGE /getting-started

PAGE /pricing

PAGE /blog

PAGE /docs

Admin pages

PAGE /admin/gpu-monitor

PAGE /admin/scraper-monitor

PAGE /admin/schema-playground

PAGE /admin/rolling-monitor

APIs — /api/ms/

API schema

API schema_generator

API scrape

API inference

API rolling

API vastai

API blog

API docs

baselinelabs.ai

Pages

PAGE /home

PAGE /dashboard

PAGE /getting-started

PAGE /reports

PAGE /reports/new

PAGE /reports/view

PAGE /templates

PAGE /templates/edit

PAGE /templates/overview

PAGE /brand-reports

PAGE /brand-reports/new

PAGE /brand-reports/view

PAGE /brand-templates

PAGE /brand-templates/edit

PAGE /backlinks

PAGE /analytics

PAGE /ranking

PAGE /analysis

PAGE /mention-scan

PAGE /resources

PAGE /about-us

PAGE /faq

PAGE /contact

Admin pages

PAGE /admin/errors

APIs — /api/geo/

API search

API parse

API reports

API brand_reports

API backlinks

API mentions

API analytics

API keywords

API screenshots

API email

server.py auto-discovers any .py with a router export. compile.py compiles pages. Caddy rewrites clean URLs: markupschema.com/X → /ms/X, baselinelabs.ai/X → /geo/X internally.

Billing System

Three layers: Stripe subscriptions, tier limits (hard caps), and unified credits (usage-based). Admin/internal bypass all checks.

Stripe Subscriptions

core/api/billing.py → shared.subscriptions

One subscription per user per category. Three categories: ms_api (Schema API), ms_rolling (Rolling Schema), geo (GEO Studio). Checkout, plan changes, and cancellations via Stripe. Webhooks sync status to DB.

Tier Limits (Hard Caps)

core/shared/credits.py → check_tier_limit()

Per-category caps enforced at resource creation. Not credits — these gate features.

GEO: Free=1 domain, Essentials=3, Classic=10, Select=100
MS API: Free=100 req/day, Starter=1k, Pro=10k
MS Rolling: 2 domains, 100 pages/hour

Unified Credits (Usage-Based)

core/shared/credits.py → require_credits()

Single credit pool per user across all products. Deducted atomically before each operation. 402 response if insufficient.

Sources: Monthly subscription grant (invoice.paid webhook), top-up purchase, admin grant
Sinks: Search results (2-28 credits/result by engine), brand scans (50-200), reports, schema generation (100)

Enforcement Points

All auth paths → same user_id → same credit balance

Browser cookies, API keys, and MCP/OAuth tokens all resolve to the same user_id. Credit checks and tier limits apply identically regardless of auth method. Admin users (is_admin=true) and internal API keys bypass all checks — operations are logged but not deducted.

Credit Tables (shared.*)

credit_balances user_id PK, balance, updated_at

credit_ledger append-only audit: amount, source, category, operation, reference_id

subscriptions user_id + category UK, stripe IDs, status, period dates

one balance per user cross-product atomic deducts

MS Tiers + Credits

Free 100 req/day, 50 credits/mo

Starter 1k req/day, 500 credits/mo — €29

Pro 10k req/day, 2k credits/mo — €99

Rolling 2 domains, 100 pg/hr — €19

GEO Tiers + Credits

Free 1 domain, 50 credits/mo

Essentials 3 domains, 1k credits/mo — €19

Classic 10 domains, 5k credits/mo — €39

Select 100 domains, 50k credits/mo — €59

Webhook flow: invoice.paid → _grant_monthly_credits() → grant_credits() → credit_balances += monthly_credits. Usage flow: API call → require_credits(amount) → atomic UPDATE ... WHERE balance >= amount → credit_ledger entry → 402 if insufficient. Top-ups: separate Stripe product → same grant_credits() path → same balance pool.

Admin access required

Site Map

Product Architecture

Docker Services

Network Topology

Data Flow — Schema Generation

Database Schema

Pages + APIs

Billing System

Hi, I'm George.