I Track Every Search Query on My Platform — Here's What I Learned

Search analytics and data — Photo by Luke Chesser / Unsplash

Why I Stopped Trusting Google Analytics for Product Decisions

When I launched BirJob.com — a job aggregator for Azerbaijan that pulls listings from roughly 80+ sources across the local market — I set up Google Analytics like everyone does. GA4, the tracking snippet, the whole thing. And for a while I convinced myself that sessions, bounce rates, and page views were telling me something useful.

They were not.

Google Analytics can tell you that 300 people visited your homepage on Tuesday. It cannot tell you that 47 of those people searched for "Kapital Bank" and got zero results, which means you are missing one of the most searched employers in Azerbaijan. It cannot tell you that people on Yandex Browser — a segment completely invisible in most western analytics tools — represent a notable portion of your Azerbaijani audience. It cannot correlate what a specific anonymous visitor searched for with their device, their referrer, and whether they found what they were looking for.

Search is the primary action on BirJob. The search box is what users interact with within the first few seconds of landing. If I want to understand whether the product is working, I need to understand search. Not traffic. Search.

So I built my own search logging system. This article is about what I built, why I built it that way, the frustrating proxy chain bug that gave me completely wrong geolocation data for weeks, and what the data actually revealed about job searching behavior in Azerbaijan.

Designing the search_logs Table

BirJob runs on Next.js deployed to Vercel, with a PostgreSQL database managed through Prisma. The search logging is entirely server-side — a fire-and-forget write that happens inside the /api/jobs route handler after the query executes.

Here is the full schema as it lives in the codebase today:

model search_log {
  id            BigInt   @id @default(autoincrement())
  user_id       Int?
  session_id    String?  @db.VarChar(64)
  query         String   @db.VarChar(500)
  results_count Int      @default(0)
  source_filter String?  @db.VarChar(100)
  country       String?  @db.VarChar(100)
  city          String?  @db.VarChar(100)
  region        String?  @db.VarChar(100)
  device        String?  @db.VarChar(20)
  ip            String?  @db.VarChar(45)
  user_agent    String?  @db.VarChar(500)
  browser       String?  @db.VarChar(100)
  os            String?  @db.VarChar(100)
  referrer      String?  @db.VarChar(500)
  created_at    DateTime @default(now())
  user          user?    @relation(fields: [user_id], references: [id], onDelete: SetNull)

  @@index([query])
  @@index([results_count])
  @@index([created_at])
  @@index([user_id])
  @@index([ip])
  @@map("search_logs")
  @@schema("website")
}

Why each field exists

query — The raw search string the user typed, trimmed and capped at 500 characters. This is the most valuable field in the table. Every other field provides context around it.

results_count — How many deduplicated job listings matched this query. This single integer is the difference between a successful search and a failed one. A search with zero results is a direct signal that the platform is missing something. The index on this column exists specifically so I can query WHERE results_count = 0 without scanning the whole table.

user_id — If the user is logged in, we store their ID. This lets me understand what authenticated users are looking for versus anonymous visitors. Logged-in users on BirJob are typically more engaged — they have email alerts set up, they may have uploaded a CV — so their search behavior is worth separating out.

session_id — A cookie value (birjob_sid) that is set on first visit and persists across page loads. This lets me follow a single anonymous visitor's search journey without requiring authentication. If someone searches "mühasib" and then immediately searches "mühasib Bakı", I can see that as a single session refining their intent.

device — Classified from User-Agent into four buckets: mobile, tablet, desktop, bot. Bots get logged but are easy to filter. The mobile vs desktop split in Azerbaijan turns out to be extremely skewed toward mobile, which has direct implications for UI priorities.

ip — The real visitor IP after unwrapping the Cloudflare/Vercel proxy chain (more on this below). Used for deduplication and rate analysis, not for individual tracking.

user_agent — The raw User-Agent string, capped at 500 characters. Stored in full because the parsed browser and OS fields might miss edge cases, and having the raw string lets me reparse later.

browser and os — Parsed at write time from the User-Agent. The parsing logic runs server-side so the analytics queries do not have to do string operations at query time.

country, city, region — Geolocation from HTTP headers. Not from an IP lookup API — from Cloudflare's own headers, which are injected before the request reaches Vercel. More on this below too.

referrer — The normalized referring domain. If someone searched Google for "iş elanları Bakı" and landed on BirJob, that referrer is stored as google.com. Internal navigation (referrer is the same domain) is stored as null.

source_filter — BirJob shows jobs from many sources (LinkedIn, Rabota.az, Career.az, company websites, etc.). Users can filter by source. If a search has a source filter applied, it is stored here. This is separate from the query so I can understand filter usage independently.

The index strategy

There are five indexes on this table. The reasoning is simple:

query — For the top keywords query, which does GROUP BY query ORDER BY count DESC
results_count — For zero-result searches (WHERE results_count = 0), which is probably the most frequently run analytical query
created_at — For the time-range filter that every query uses (WHERE created_at >= $since)
user_id — For correlating searches with registered users
ip — For the unique IP count query and for spotting scraper bots that hammer the search box

The table uses BigInt for the primary key rather than Int. With every search on page 1 logged, this table grows fast. An Int primary key caps at about 2.1 billion rows, which sounds like a lot until you consider that a platform with real traffic could hit that in a few years. BigInt is the right choice from the start.

The IP Extraction Nightmare

This is the part that cost me real time and resulted in weeks of incorrect geolocation data before I caught it.

BirJob's production infrastructure looks like this: DNS is managed by Cloudflare, which proxies all traffic through its edge network before forwarding requests to Vercel, which then routes them to the Next.js serverless functions. So the actual network path for a request from a user in Baku looks like this:

User in Baku
  → Cloudflare edge (likely Frankfurt or Amsterdam)
    → Vercel edge network
      → Next.js serverless function

The problem is what ends up in the HTTP headers by the time the request reaches the serverless function.

Why x-forwarded-for shows Amsterdam instead of Baku

The x-forwarded-for header is supposed to contain the original client IP. In a naive single-proxy setup, it does. But with Cloudflare sitting in front of Vercel, you get something like this:

x-forwarded-for: 185.56.80.43, 198.41.212.67
x-real-ip: 198.41.212.67

The first IP (185.56.80.43) is the real user. The second IP (198.41.212.67) is a Cloudflare edge node in Amsterdam. Vercel, receiving this request, sees the connecting IP as the Cloudflare edge node in Amsterdam, and it sets x-real-ip to that Amsterdam IP.

Vercel also provides its own geo headers: x-vercel-ip-country, x-vercel-ip-city, and x-vercel-ip-country-region. These are derived from the connecting IP, which in our setup is Cloudflare's edge — not the user. So if I naively read x-vercel-ip-country, I get NL (Netherlands) for every visitor, regardless of where they actually are.

For weeks, the search_logs table was full of country = "Netherlands". I thought the data was just sparse. It was not sparse — it was completely wrong.

The correct priority chain

Cloudflare injects its own set of headers before forwarding requests to the origin, and these are based on the real visitor IP as seen by Cloudflare's edge:

cf-connecting-ip — Cloudflare sets this to the genuine visitor IP, always. This is the most reliable header for IP extraction.
true-client-ip — Available on Cloudflare Enterprise and via Managed Transforms. Same purpose as cf-connecting-ip.
cf-ipcountry — Two-letter ISO country code based on the visitor's real IP, resolved by Cloudflare.
cf-ipcity — City name, available when Cloudflare's Managed Transforms "Add visitor location headers" is enabled.
cf-region / cf-region-code — Region name and code.

The final implementation in src/lib/logEvent.ts uses these headers with correct fallback priority:

function extractIp(headers: { get(name: string): string | null }): string | null {
  // 1. Cloudflare always sets this to the real visitor IP
  const cfIp = headers.get('cf-connecting-ip');
  if (cfIp) return cfIp.trim();

  // 2. Cloudflare Enterprise / Managed Transform "True-Client-IP"
  const trueClient = headers.get('true-client-ip');
  if (trueClient) return trueClient.trim();

  // 3. x-forwarded-for — first entry is the real client IP
  //    (Cloudflare adds it before Vercel appends its own)
  const xff = headers.get('x-forwarded-for');
  if (xff) {
    const first = xff.split(',')[0].trim();
    if (first) return first;
  }

  // 4. Last resort — may be Cloudflare edge IP, not visitor IP
  const realIp = headers.get('x-real-ip');
  if (realIp) return realIp.trim();

  return null;
}

function extractGeo(headers: { get(name: string): string | null }) {
  // Cloudflare geo headers first — Vercel's resolve from the
  // connecting IP (Cloudflare edge), not the real user
  const country =
    headers.get('cf-ipcountry') ||
    headers.get('x-vercel-ip-country') ||
    null;

  const rawCity =
    headers.get('cf-ipcity') ||
    headers.get('x-vercel-ip-city') ||
    null;
  const city = rawCity ? decodeURIComponent(rawCity) : null;

  const region =
    headers.get('cf-region') ||
    headers.get('cf-region-code') ||
    headers.get('x-vercel-ip-country-region') ||
    null;

  return { country, city, region };
}

How long it took to find the bug

About three weeks. The tell was when I finally looked at the country breakdown in the admin panel and saw that roughly 95% of searches were attributed to Netherlands, with a handful to Russia and the United States. BirJob is an Azerbaijan-specific platform. The entire content is in Azerbaijani and Russian. There is no reason anyone in the Netherlands would be searching for jobs here in volume.

The fix was a one-line priority change: check cf-ipcountry before x-vercel-ip-country. After deploying, the data immediately started showing Azerbaijan as the dominant country — which is correct.

The lesson: when you sit behind multiple proxy layers, never assume the standard headers contain what you think they contain. Always print the full headers during development and verify against a known location.

Note: the city header cf-ipcity may be URL-encoded. Baku can arrive as Bak%C4%B1 (the Azerbaijani spelling "Bakı" uses a dotless i). The decodeURIComponent call in the geo extractor handles this.

What the Data Revealed

Once the geolocation bug was fixed and a few weeks of clean data had accumulated, the search logs started telling a story that no amount of Google Analytics sessions would have surfaced.

Zero-result searches as a product roadmap

This is the single most valuable view in the entire admin dashboard. Zero-result searches are searches where a user typed something, submitted it, and the database returned nothing. Every one of those is a failed interaction. And when you group and count them, you get a prioritized list of what to build next.

The SQL that powers this view is straightforward:

SELECT query, COUNT(*) AS search_count
FROM website.search_logs
WHERE results_count = 0
  AND created_at >= NOW() - INTERVAL '30 days'
GROUP BY query
ORDER BY search_count DESC
LIMIT 20;

What comes back is a direct signal of which companies or job types are missing from the scraper coverage. Real examples from BirJob's zero-result panel:

"Kapital Bank" — One of the largest banks in Azerbaijan. Their careers page requires JavaScript rendering that the original scraper did not handle. Multiple searches per week from different users, all returning zero. That went straight to the top of the scraper backlog.
"SOCAR" — The State Oil Company of the Azerbaijan Republic. Their jobs page is behind an enterprise ATS. Users kept searching for it expecting to see listings aggregated. Added a note in the source filter UI explaining that SOCAR listings link out to their own portal.
"Pasha Bank" — Another major financial institution. Their website has listings but the scraper was hitting a 403 because GitHub Actions IPs are blocked by their CDN — the same problem we have with a number of sources.
"remote" / "uzaqdan" — People searching for remote work. BirJob does not currently have a "remote" tag or filter. This tells me users expect it. It is now on the roadmap as a structured data field.
"mühasib" (accountant) — This one is interesting because it does return results, but earlier it did not. The issue was a normalization gap: the scraper for a particular source stored titles in English, but Azerbaijani users search in Azerbaijani. Understanding that gap led to considering a synonym/translation layer for search.

The admin panel displays this as a simple ranked list with a red count badge. It looks mundane but it is arguably the most decision-driving view in the entire system.

Peak search times in Azerbaijan

The daily volume chart in the admin dashboard is a bar chart of search count per day. Looking at the intra-day distribution (via the raw logs with EXTRACT(HOUR FROM created_at)), a clear pattern emerges for Azerbaijan Standard Time (UTC+4):

Search volume picks up sharply around 9 AM and peaks between 10 AM and 12 PM. This is the mid-morning window when people at work are checking job listings.
There is a secondary peak around 8–10 PM. This is the evening job-hunting session from home, typically on mobile.
Weekday traffic is heavier than weekend traffic, but the weekend evening peak is proportionally larger than the weekday evening peak. People have more time to browse on Saturday and Sunday evenings.
Monday mornings consistently show the highest search volume of the week. The "Monday morning job search" is real and quantifiable.

This has practical implications: the scraper runs daily, and knowing that most users arrive on Monday morning means it makes sense to ensure the scraper has completed its run and updated the database before the Monday morning peak, not after.

Most searched keywords

The top keywords view groups searches by query text, counts occurrences, and shows the average results count per query. This gives you two dimensions: frequency (how often is this searched) and satisfaction (do those searches actually return results).

The most searched categories on BirJob are predictable once you understand the Azerbaijani job market context:

Finance roles: mühasib (accountant), maliyyə (finance), bank, kredit
IT roles: proqramçı (programmer), developer, IT, 1C
Sales: satış (sales), menecer (manager), agent
Administrative: katib (secretary), inzibati, ofis
Company name searches: specific bank names, telecom operators, oil companies

The company-name searches are particularly useful. They reveal brand affinity — these users are not searching for a job category, they want to work at a specific company. If that company is not in the scraper coverage, that is an immediate acquisition failure.

Device breakdown

The device split is striking if you have developed primarily on desktop. On BirJob, mobile accounts for a significant majority of searches — consistently over 65% in the data. Desktop is second, and tablet is a rounding error.

Azerbaijan has very high smartphone penetration relative to desktop ownership, particularly outside Baku. The job market is not concentrated in white-collar offices — there are large numbers of people looking for retail, construction, hospitality, and driver jobs. These users are searching on their phone, often on mobile data, often in the afternoon or evening.

This mobile-first reality drove several UI decisions: the search box must be prominent on first load without scrolling, the job card layout prioritizes scannable information over dense detail, and load performance on 4G connections (not fiber) is treated as a first-class concern.

Browser distribution in Azerbaijan

This is where the data diverges most sharply from what western-centric analytics reports would lead you to expect.

Chrome dominates, as it does everywhere. But the second-most-used browser on BirJob by a margin that surprised me is Yandex Browser — a Chromium-based browser developed by the Russian tech company Yandex, which has significant install base across post-Soviet states including Azerbaijan. If you only look at western analytics reports, Yandex Browser does not appear. But if your platform serves Azerbaijani users, it is present enough that you need to test on it.

Samsung Internet Browser — the default browser on Samsung Android devices — also appears in the data with meaningful volume. Samsung is a dominant smartphone brand in Azerbaijan, particularly in the mid-range segment.

Safari rounds out the picture for iPhone users. Firefox is a small but non-zero presence. IE is essentially gone.

The browser breakdown matters for CSS rendering, JavaScript compatibility, and especially for testing PWA behavior and beforeinstallprompt events, which behave differently across these browsers.

Building the Admin Analytics Dashboard

The admin panel is a single-page React component at src/app/admin/page.tsx, with the search analytics section extracted into a dedicated component at src/components/admin/SearchAnalyticsPanel.tsx. The data is served from a single API route at /api/admin/search-analytics.

The API route design

Rather than building multiple separate endpoints for each visualization, a single API call fetches everything the analytics panel needs. This is a deliberate tradeoff: the response payload is larger, but there is only one database round-trip per page load (which actually runs multiple parallel queries internally).

const [
  recentSearches,
  totalCount,
  topKeywords,
  topBrowsers,
  topOS,
  topCountries,
  topCities,
  dailySearches,
  zeroResultSearches,
  uniqueIPs,
] = await Promise.all([
  // All queries run concurrently via Promise.all
  prisma.search_log.findMany({ /* paginated detail log */ }),
  prisma.search_log.count({ where }),
  prisma.search_log.groupBy({ by: ['query'], /* top keywords */ }),
  prisma.search_log.groupBy({ by: ['browser'], /* browser breakdown */ }),
  prisma.search_log.groupBy({ by: ['os'], /* OS breakdown */ }),
  prisma.search_log.groupBy({ by: ['country'], /* country breakdown */ }),
  prisma.search_log.groupBy({ by: ['city'], /* city breakdown */ }),
  prisma.$queryRaw`
    SELECT DATE(created_at)::text AS date, COUNT(*) AS count
    FROM website.search_logs
    WHERE created_at >= ${since}
    GROUP BY DATE(created_at)
    ORDER BY date ASC
  `,
  prisma.search_log.groupBy({
    by: ['query'],
    where: { created_at: { gte: since }, results_count: 0 },
    _count: { id: true },
    orderBy: { _count: { id: 'desc' } },
    take: 20,
  }),
  prisma.search_log.findMany({
    where: { created_at: { gte: since }, ip: { not: null } },
    select: { ip: true },
    distinct: ['ip'],
  }).then(r => r.length),
]);

The daily search volume query uses raw SQL because Prisma's ORM layer does not have a GROUP BY DATE(column) abstraction. The ::text cast in PostgreSQL is needed because Prisma returns Date objects for timestamp columns, but we want plain date strings for charting.

One gotcha: Prisma returns bigint for COUNT(*) in raw queries (it maps to PostgreSQL's bigint). JavaScript's JSON.stringify cannot serialize native BigInt. The fix is to explicitly convert: Number(r.count) before returning the response.

The zero-result searches panel

In the frontend, the zero-result panel has a description that makes its purpose explicit: "Bu sorğular sıfır nəticə qaytardı — yeni mənbə üçün işarə" (These queries returned zero results — an indicator for new sources). It is placed next to the top keywords panel so the contrast is visible: here is what people are finding, here is what they are not finding.

Each entry shows the query text and the count of times it was searched with zero results. Clicking through to the detail log lets me see the individual searches, including the timestamp, so I can tell whether a zero-result query is consistently failing or if it was only a problem during a specific time window (for example, if a scraper broke for 48 hours).

The detail log with expandable rows

The "Ətraflı Loglar" (Detail Logs) tab shows individual search entries in a paginated table. Each row shows: query text, results count (red if zero), IP address, location (city / region / country), browser and OS, device type, and timestamp.

Clicking a row expands it to show the raw User-Agent string, referrer, session ID (first 16 characters), and the user account if the search was made while logged in. This level of detail is useful when investigating anomalies — for example, if a query appears to be a bot pattern, the full User-Agent and IP combination confirms it.

The query filter at the top of the detail view lets me search within the logs. If I want to understand every search involving "developer" in the last 30 days — including what results count they got, whether they were on mobile, and where they came from — I can get that in seconds.

The time range selector (7, 30, 90, 365 days) adjusts all panels simultaneously. Looking at 365 days is useful for spotting seasonal trends. Looking at 7 days is useful when I have just shipped a change and want to see immediate impact on search behavior.

Privacy Considerations

Logging every search query with IP, browser, and location is a meaningful data collection decision. I want to be honest about what BirJob collects and why.

What we collect and why

The search log records: what you searched for, whether you found results, what device and browser you used, roughly where you are (country/city level), and a session identifier to group your searches together. There is no tracking pixel, no cross-site tracking, no behavioral profiling beyond the search log.

The purpose is operational: to make the platform better. The zero-result searches directly drive which scrapers get added or fixed. The device distribution directly informs UI decisions. The browser breakdown informs compatibility testing priorities.

Azerbaijan's personal data law

Azerbaijan has the Fərdi məlumatlar haqqında (Law on Personal Data), which governs the collection and processing of personal information. IP addresses are considered personal data under this framework. The law requires that data processing have a legitimate purpose, that users be informed about what is collected, and that data not be retained longer than necessary.

BirJob's privacy policy discloses the search log collection and its purpose. The legitimate basis is the platform's operational improvement interest. Users who do not want their searches logged can use the platform without creating an account; the user_id field will be null.

IP anonymization

The IP address is stored in full in the database, but it is only accessible through the admin panel, which requires authentication. For future consideration: truncating the last octet of IPv4 addresses (storing 185.56.80.x instead of the full address) is a standard anonymization technique that preserves geographic utility while reducing re-identifiability.

The current approach is to store the full IP but treat it as internal operational data, not as user-facing information. No IP address is ever exposed in the public API.

Data retention

Search logs are not currently auto-expired. This is on the technical debt list. A cron job that deletes entries older than 12 months would be straightforward to implement and would cap the table size at a reasonable level while preserving the data needed for trend analysis. A year of history is more than sufficient for every analytical use case I have found.

The web_event_logs table, which logs page views and other events, follows the same pattern. Both tables use created_at indexes specifically to make time-range deletes efficient.

Transparency

Being a solo developer running a platform in a local market where trust is established differently than in western tech markets means I cannot afford to be vague about data practices. If a user asks me directly whether I log their searches, the answer is: yes, along with your approximate location and device type, and here is what I use it for. That straightforwardness matters more to me than appearing minimalistic about data collection.

What I Would Do Differently

A few things I got wrong or would approach differently with the benefit of hindsight:

Test the proxy chain before shipping. The geolocation bug was entirely avoidable. Before deploying any IP or geo-based feature in production, I should have made a test request from a known location and logged every header to a temporary endpoint. The thirty-second check I skipped cost three weeks of bad data.

Log the page number. The current implementation only logs searches on page 1 to avoid duplicate entries for pagination. This is the right call. But I would add a separate field for whether results were actually interacted with — a "click-through" signal. Currently I can tell if a search returned results, but I cannot tell if those results were any good. The user may have gotten 50 results and clicked on none of them. That is a quality signal I am missing.

Normalize queries before logging. "mühasib" and "Mühasib" and "MÜHASİB" are stored as three different queries. Lowercase normalization at log time would make the aggregations cleaner. The current fix is to use case-insensitive grouping in the SQL queries, but storing the normalized form would be cleaner.

Add a feedback mechanism for zero-result searches. Right now I see that someone searched for something and got zero results, but I have no way to know if they gave up or came back the next day and found what they needed. A simple "Did you find what you were looking for? [Yes / No]" prompt shown after a zero-result search would close that loop. The data would be valuable. The implementation is two fields in the search_log table and a client-side follow-up call.

Set up automated alerting on zero-result spikes. Currently I check the zero-result panel manually. If a scraper breaks and a major employer's listings disappear from the index overnight, the zero-result count for company name searches would spike immediately. An automated check — running nightly and alerting me if zero-result searches for any single query exceed a threshold — would catch breakages faster than my manual review cadence.

Consider full-text search indexing earlier. The current search implementation is a PostgreSQL ILIKE query: title ILIKE '%keyword%' OR company ILIKE '%keyword%'. This works for an exact-substring match but misses synonyms, morphological variations (Azerbaijani is an agglutinative language — the root word changes significantly with suffixes), and typos. The search log data shows exactly what users are typing. That corpus is the perfect training set for understanding what synonyms and alternate forms need to be handled.

Conclusion: Search Logs as the Cheapest User Research Tool

I have run user interviews, watched session recordings, and stared at funnel charts. All of those have their place. But for a job aggregator platform, nothing has come close to the search log in terms of actionable insight per unit of time invested.

The economics are compelling. The implementation cost was maybe four hours of work: schema migration, server-side logging function, API endpoint, and admin panel component. The ongoing cost is a few hundred kilobytes of database writes per day. In return, I get a continuously updated, quantitative view of exactly what my users want and exactly where the platform is failing them.

Zero-result searches are the most honest form of product feedback that exists. Nobody is filling out a form or leaving a review. They typed what they wanted, the platform did not have it, and that fact is recorded in a row in the database. No interpretation required. No sample bias. No social desirability effect. Just a raw signal: this person wanted this thing and did not find it here.

For a solo developer without a UX research team, a user testing budget, or even a proper customer support channel, that kind of direct signal is invaluable. The scraper roadmap for the next three months is essentially a printout of the top zero-result queries. The UI improvement list for mobile is informed by the device breakdown. The compatibility testing checklist includes Yandex Browser and Samsung Internet because the data says to.

If you are building any kind of search-centric platform and you are not logging searches, start today. The table design does not need to be perfect. Log the query, log the results count, log the timestamp. You can add more fields incrementally. But get the data flowing, because the retrospective view — being able to look back and say "for the past 90 days, these are the things users wanted and did not find" — is worth more than almost any other metric you are probably tracking.

The search box is where users tell you what they need. The least you can do is listen.

Loading BirJob...