Building a Web Scraper That Runs 91 Job Sites Daily — A Technical Deep-Dive

Web scraping and data extraction — Photo by Markus Spiske / Unsplash

This is a technical deep-dive into how BirJob works under the hood. If you're a developer interested in web scraping at scale, data engineering, or just curious about what it takes to aggregate 91 job sites daily — this is for you.

Every day at 12:00 Baku time, a GitHub Actions workflow kicks off. Within 15–20 minutes, it visits 91 different job websites across Azerbaijan, extracts every listing it can find, deduplicates them, and upserts them into a PostgreSQL database. By 12:20, BirJob's users are browsing fresh data from 77+ live sources — without knowing or caring that the data came from dozens of different websites with completely different structures.

This is the story of how that system works, what broke along the way, and what I learned building it.

The Architecture

BirJob's scraper is a Python application that runs inside a Docker container on GitHub Actions. The core stack:

aiohttp — async HTTP client for non-blocking requests across 91 sources simultaneously
BeautifulSoup + lxml — HTML parsing for sites that serve static content
Playwright — headless browser for JavaScript-rendered SPAs that don't have APIs
pandas — DataFrame operations for cleaning, normalizing, and deduplicating job data
psycopg2 + asyncpg — PostgreSQL drivers for the upsert pipeline
GitHub Actions — free CI/CD that runs the whole thing on a cron schedule

The design philosophy is simple: every scraper is independent, every scraper extends a base class, and the system should survive individual failures without affecting the rest.

The Base Scraper Pattern

Every scraper in BirJob extends BaseScraper. This class provides three things that every scraper needs: async HTTP fetching with retries, database operations, and error handling.

Async Fetching with Retry Logic

The fetch_url_async method is the workhorse. It handles:

Exponential backoff with jitter — when a site returns 429 (rate limited) or 503 (server overloaded), the scraper waits exponentially longer between retries: 3 seconds, then 9, then 27 — plus a random 2–5 second jitter to avoid thundering herd problems
User-Agent rotation — 10 different browser User-Agent strings, randomly selected per request. Some sites block requests that look like bots
Encoding detection — not every Azerbaijani website serves UTF-8. The chardet library auto-detects encoding for sites that serve Windows-1252, ISO-8859-1, or other encodings
Timeout escalation — 45 seconds locally, 60 seconds in CI (GitHub Actions has more latency but also more patience)
Connection pooling — 100 total TCP connections, 15 per host, with DNS caching and keepalive

Every HTTP request goes through this method. No scraper makes raw requests. This means every scraper automatically gets retries, rotation, and encoding handling for free.

The Error Handler Decorator

Every scrape method is wrapped with @scraper_error_handler. If a scraper throws any exception — network timeout, parsing error, site structure change — the decorator catches it, logs the error, and returns an empty DataFrame with error metadata attached. The system continues. One broken scraper never takes down the other 90.

91 Scrapers, 91 Different Problems

Here's the thing about scraping 91 websites: no two sites work the same way. Some have clean REST APIs. Some serve static HTML. Some are Next.js SPAs that render everything in JavaScript. Some use GraphQL. Some change their structure every few months. Here are four real examples from BirJob:

1. The Clean API (Azercell)

Azercell's career page uses EasyHire, which exposes a JSON API. The scraper hits /job/search?json=true&page=N, paginates through results, and extracts structured data. This is the dream scenario — reliable, fast, and unlikely to break.

But even clean APIs have gotchas. In February 2026, I discovered the API uses id not _id for job identifiers. One character difference, and every apply link was wrong. Scrapers break in subtle ways.

2. The GraphQL SPA (Boss.az)

Boss.az is built with Next.js and Apollo GraphQL. There's no static HTML to parse — the page is a shell that loads data via GraphQL queries. The scraper sends POST requests to /graphql with query strings, but here's the challenge: the GraphQL schema isn't documented, and it changes without notice.

The solution? Try three different query formats and use whichever one returns data. It's not elegant, but it's resilient. When Boss.az inevitably changes their schema again, there's a chance one of the fallback queries still works.

3. The Shape-Shifting Page (KPMG)

KPMG's careers page has been redesigned at least three times since I started scraping it. Each redesign changes the HTML structure completely. The current scraper has a three-tier fallback strategy:

Try parsing <h6> elements with nested links (current structure)
If that fails, try li.cmp-text-list__item elements (previous structure)
If that fails, try the old table layout (original structure)

One of these three strategies will probably work for any given redesign. This approach has saved me from emergency scraper fixes multiple times.

4. The CSS Hash Problem (TABIB)

Azerbaijan's healthcare portal (TABIB) is a Next.js app that uses CSS modules. CSS module class names include hashes that change every deployment — .vacancy_abc123 becomes .vacancy_def456 tomorrow. Selecting elements by class name is useless.

The fix: ignore CSS entirely and extract data from Next.js's __NEXT_DATA__ JSON object, which is embedded in every page. It's the same data the page uses to render itself, but in structured JSON format. Stable across deployments.

The Orchestration Layer

The ScraperManager coordinates all 91 scrapers. It handles concurrency, timing, health checks, and failure tracking.

Concurrency Control

Running 91 scrapers simultaneously would be fast — and would also get you IP-banned from half the internet. The system uses an asyncio.Semaphore to limit concurrent scrapers:

Local development: 10 concurrent scrapers
GitHub Actions: 2 concurrent scrapers

Why only 2 in CI? GitHub Actions runs on shared IP ranges. If 10 scrapers hit different sites simultaneously from the same IP, some sites interpret it as a bot attack and block the IP. Reducing concurrency to 2, with random 3–8 second delays between starts, keeps things under the radar.

Health Checks

Before scraping a site, the manager can check if it's even responsive. If a site has been consistently down, it gets skipped entirely — saving 60 seconds of timeout per dead site. With 15+ disabled scrapers, this saves minutes per run.

Database-Driven Configuration

Disabled scrapers are stored in a scraper_config table in the database, not hardcoded. This means I can disable a broken scraper by updating a database row, without touching code or redeploying. When the site comes back online, re-enable it the same way.

The Deduplication Pipeline

When you aggregate from 91 sources, duplicates are inevitable. The same job appears on multiple job boards. The same company posts variations of the same role. BirJob handles this in four layers:

Layer 1: Within Each Scraper

Before a scraper returns its DataFrame, duplicate apply_link values are dropped. This catches the case where a paginated API returns the same job on multiple pages.

Layer 2: Title Normalization

A normalize_title function strips noise from job titles before comparison:

"Frontend Developer (React)" → "frontend developer"
"Senior Python Developer - Bakı" → "senior python developer"
"JAVA DEVELOPER!!!" → "java developer"

Parenthetical content, location suffixes, excessive punctuation, and extra whitespace are all removed. Then a compound key of normalized_company::normalized_title catches near-duplicates from different sources.

Layer 3: Database UPSERT

The database has a unique constraint on apply_link. When a job is inserted and the link already exists, it becomes an UPDATE instead — refreshing the last_seen_at timestamp and keeping is_active = true. This is the authoritative deduplication layer.

Layer 4: Source-Level Deactivation

After each run, jobs from a given source that weren't seen in the current run get marked as is_active = false. If Azercell had 50 jobs yesterday but only shows 45 today, the 5 missing ones get deactivated. They're not deleted — they're historically preserved but hidden from search results.

Running on GitHub Actions (For Free)

BirJob's scraper runs entirely on GitHub Actions' free tier. The workflow:

Trigger: Cron schedule — 0 8 * * * UTC (12:00 Baku time), every day
Build: Docker image with all Python dependencies, including Playwright browsers
Execute: Run all scrapers with concurrency limits
Persist: UPSERT results into PostgreSQL (hosted externally)
Notify: Send a Telegram message with the run summary

The entire pipeline — build, scrape 91 sites, process data, write to DB, send notification — completes in 15–20 minutes. Well within GitHub Actions' free tier limits.

The Telegram Notification

After every run, a Telegram bot sends me a report:

How many scrapers succeeded and how many jobs each found
Which scrapers failed and why (timeout, blocked, 404, parsing error)
Database stats: new jobs added, existing jobs updated, stale jobs deactivated
Total run duration
User engagement metrics: total users, new subscribers, top search keywords

This daily digest is how I know if something's broken before users notice. If a major source suddenly returns 0 jobs, I see it in the Telegram report and can investigate.

What Breaks and Why

In two years of running this system, here's every category of failure I've encountered:

Site Redesigns

The most common failure. A company redesigns their careers page, changing CSS classes, HTML structure, or URL patterns. The scraper's selectors no longer match anything and it returns 0 jobs. Fix: update the selectors. Prevention: use multiple selector strategies (like the KPMG approach) or extract from data layers (__NEXT_DATA__, embedded JSON) instead of HTML structure.

IP Blocking

GitHub Actions runs on well-known IP ranges. Some sites (Djinni, Boss.az, HRCBaku) block these IPs outright. The scraper gets connection resets or 403 errors. There's no good fix for this except using a proxy — which adds cost and complexity. For now, these scrapers are disabled in CI and run manually when needed.

Cloudflare Protection

Several sites (Workly.az, Jobing.az) sit behind Cloudflare's bot detection. Even with realistic User-Agent headers, Cloudflare's JavaScript challenge blocks automated requests. Playwright can sometimes bypass this, but Cloudflare's detection is continuously improving. This is an arms race I'm not trying to win.

API Deprecation

ProJobs killed their API entirely — core.projobs.az/v1/vacancies started returning 404 one day with no announcement. The scraper now tries multiple API endpoint candidates, but when none work, it's disabled until a new API surface is discovered.

Encoding Nightmares

Some Azerbaijani websites serve content in Windows-1252 or ISO-8859-9 (Turkish encoding) with incorrect Content-Type headers claiming UTF-8. The result: garbled text with characters like Ã¼ instead of ü. The chardet library handles most cases, but occasionally requires manual encoding overrides per source.

The Numbers

As of March 2026:

93 scraper files in the codebase
77 live sources actively returning data
16 disabled scrapers (dead sites, IP blocks, Cloudflare)
~9,400 total jobs in the database
~8,600 active jobs visible to users
~700 new job postings detected every weekday
15–20 minutes per full scraper run
$0/month in compute costs (GitHub Actions free tier)

Lessons Learned

1. apply_link as the primary key was the best early decision. URLs are stable, unique, and meaningful. Using auto-incrementing IDs or composite keys (company + title) would have caused constant deduplication failures as sites change their titles and formatting.

2. Async is essential at this scale. Synchronous requests to 91 sites would take hours. With aiohttp and controlled concurrency, it takes minutes. The async paradigm also naturally handles timeouts — a single slow site doesn't block the entire pipeline.

3. Expect everything to break. The most robust pattern I've found is: try multiple strategies, catch everything, return empty data instead of crashing. A scraper that returns 0 jobs is a minor inconvenience. A scraper that crashes the pipeline is a disaster.

4. Monitor obsessively. The Telegram notification after every run is the single most valuable feature of the system. Seeing the daily report — X scrapers succeeded, Y failed, Z new jobs — gives me immediate visibility into system health without checking dashboards.

5. GitHub Actions is underrated for this use case. Free, reliable, runs Docker, supports cron schedules, and provides secrets management. For a scraper that runs once or twice daily and completes in under 20 minutes, it's perfect. No servers to maintain, no bills to pay.

6. Sites don't owe you stability. Every scraper is borrowing structure from someone else's website. When that structure changes — and it will — it's not a bug, it's the nature of the work. Build for resilience, not correctness. Your scraper will be wrong sometimes. The goal is to fail gracefully and recover quickly.

What's Next

The system works, but there's always more to build:

Proxy rotation for IP-blocked sources — would recover 5–6 disabled scrapers
LLM-powered extraction — instead of writing custom selectors for each site, use a language model to extract job data from arbitrary HTML. Early experiments are promising
Real-time monitoring dashboard — beyond Telegram notifications, a web dashboard showing scraper health, historical trends, and failure patterns
Candidate matching — using the job data and candidate profiles to recommend relevant jobs to specific users

The scraper is the foundation. Everything BirJob does — search, filtering, analytics, email alerts — depends on fresh, accurate, deduplicated job data flowing in daily. Getting that pipeline right was the hardest part of building the product. Everything else is presentation.

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator that scrapes 91 sites daily. If you're a developer interested in web scraping, data pipelines, or building products for small markets, let's connect. Support the platform at birjob.com/support.

Loading BirJob...