Injuria – Legal Intelligence Platform Deep Dive

December 20, 2024•7 min read•

AI Web Scraping AWS Node.js React PostgreSQL DevOps

How I built a high-scale AI + scraping platform that processes 500,000+ pages/day for competitive intelligence in the legal industry.

1. High-Scale Web Scraping Infrastructure

What I Built

Given an input 10s of thousands of domains, the goal was to create a living infrastructure that continuously self-updates, scraping, processing, and enriching data without human intervention.

I architected a fleet of headless browsers (Puppeteer + Node.js) with rotating proxy infrastructure, multi-cluster orchestration, and intelligent URL prioritization. The system can process 500k+ pages/day on local hardware, bypass anti-bot systems with stealth profiles, and block non-essential assets for maximum throughput.

This included very comprehensive retry logic with exponential backoff and dead-letter recovery, ensuring that failures were isolated and retried without stalling the pipeline.

Key Learning: Scaling scraping isn't "just add threads" — it's orchestration, prioritization, and targeted retries that keep throughput high without ballooning resource costs.

Blocker & Solution

Blocker: Anti-bot measures intermittently killed entire browser clusters.

Solution: Rotating stealth profiles, tuned concurrency, health checks that auto-restart bad nodes, and proxy pool cycling.

2. AI-Powered ETL Pipeline

What I Built

A multi-agent AI orchestration layer for entity extraction, classification, and summarization.

Precursor Agents: Since some domains had hundreds of pages, I built an AI "page triage" agent whose only job was to determine which pages to send to downstream summarization agents, massively reducing token costs and avoiding wasted LLM calls.
JSON Mode: Learned to leverage OpenAI's JSON mode + schema validation to enforce structural integrity before DB insert.
Model Diversity: Used OpenAI GPT-4, Google Gemini, and AWS Bedrock (Claude & LLaMA) with model-specific rate limiting so jobs weren't "half-baked" mid-run.

After scraping, HTML was converted to markdown for cleaner text analysis.

This all flowed into a Postgres-based structured store, with section-level diffing to skip unchanged data and avoid unnecessary reprocessing.

Key Learning: First, multi-agent orchestration isn't just cool, it's the only sane way to process at scale when you're paying for tokens. Second, with the power of LLMs, it's easier than ever to analyze unstructured content at scale if you have the pipeline built right.

Blocker & Solution

Blocker: Inconsistent LLM outputs and malformed JSON.

Solution: Schema validators, auto-correction prompts, model fallbacks, and diff-based reprocessing.

3. Complex Data Architecture

What I Built

A 60-table Postgres system with JSONB, vector embeddings, and aggressive indexing, essential once the DB grew into the hundreds of GBs.

Core tables included:

entities
attorneys
locations
case_results
practice_areas
awards
testimonials

Historical change tracking + vector search enabled semantic queries and competitive intelligence analysis over time.

Older, infrequently accessed data was automatically moved to colder storage to keep queries fast and costs low.

Key Learning: Indexing tables isn't optional at scale, it's the difference between millisecond queries and me wondering "did I break something" trains of thought.

Blocker & Solution

Blocker: Early schema design struggled with multi-office/multi-attorney relationships.

Solution: Introduced junction tables + relationship constraints and refactored ingestion.

4. Real-Time Monitoring & Observability

What I Built

Socket.io-based real-time dashboards for job progress, system health, and error alerts. Event-loop utilization metrics, semantic error logging, and retry metadata tracking enabled rapid diagnosis and scaling decisions.

Key Learning: In a distributed, high-throughput system, real-time visibility is not a luxury, it's needed to ensure I don't go insane. When moving from a dev environment with only a handful of domains where spot checking data is sufficient, it's impossible to do that when dealing with 10s of millions of records. I spent a maddening amount of time on observability before building on the ETL pipeline.

Blocker & Solution

Blocker: Failures would silently ripple until data quality degraded.

Solution: Centralized logging, alerting, and per-job status tracking. I accidentally ended up building my own version of BullMQ.

5. API-First Architecture

What I Built

REST APIs powering entity management, analytics, and AI job control, plus vector + full-text search for discovery. All queries were parameterized to prevent SQL injection. APIs were refactored for composability to serve both internal dashboards and conversational AI tools.

Key Learning: Designing APIs for both machine and human consumers forces clarity and prevents dead-end endpoints.

Blocker & Solution

Blocker: Initial APIs were too rigid to support emerging use cases.

Solution: Modularized endpoints and added vector search support.

6. Agentic AI & Safe Autonomy

What I Built

Like most devs, I was initially hesitant to give an AI agent direct database access. Instead of raw write permissions, I created a "safe lane"…the agent could only read from predefined, parameterized queries, ensuring it never touched sensitive internals or ran arbitrary SQL.

These agents used the retrieved data in conjunction with RAG pipelines to generate highly contextual, domain-specific insights without me manually stitching queries together. It turned my system into something closer to a vertical semantic intelligence engine than a traditional scraper + summarizer stack.

Key Learning: Agentic AI IS the future… but the real innovation lies in the abstraction layer that verifies and mediates between the agent and your data. Like the founder of Claude said, it's on us to define the guardrails, and this project taught me how to do exactly that while keeping performance high and risk low.

Blocker & Solution

Blocker: Fear of the AI issuing harmful or costly DB operations.

Solution: Constrained access to a controlled set of safe, pre-written queries + strict JSON output validation before any downstream step touched the database.

7. Front-End UI & Component-Based Design

What I Built

While the backend is technically dense, I also built the entire front-end interface, a responsive Next.js + React application with real-time data visualizations, semantic search, and detailed entity pages tied directly into the scraping + AI pipelines.

Comfortable with React's component-driven architecture and leveraging both design systems and utility-first styling. Material UI was my go-to, but I've increasingly moved toward Tailwind CSS for flexibility and speed.

I'm pragmatic about AI-assisted UI development, the risks are minimal, so I've leaned on it to rapidly iterate and produce beautiful, functional interfaces.

Key Learning: Using Claude Code for f/e is not beneath me and I'm able to ship faster than ever before in my entire life.

8. AWS Deployment & DevOps Glue

What I Built

The entire platform runs on AWS with an EC2 Ubuntu base. I use an Application Load Balancer to handle incoming traffic, reverse-proxying requests through Nginx to separate app/API processes. AWS WAF filters unwanted traffic, while CodeDeploy handles zero-downtime deployments.

Postgres lives in RDS
SSL handled via AWS Certificate Manager
CloudFront acting as the CDN for static assets
S3 for storage of media and long-term data artifacts

The architecture is built for scale but also designed to be maintainable, everything from automated deploy hooks to environment-specific configs is baked in so scaling is more about spinning up infrastructure than rewriting code.

Key Learning: There's a lot more to DevOps than just "put it on a server." The glue setup—networking, IAM permissions, SSL, caching layers, CI/CD, and security hardening—takes as much thought as the code itself. Building the app was one challenge; making it deploy cleanly and run securely at scale was another.

Blocker & Solution

I've been deploying apps for a few years via AWS now, so there wasn't really a blocker per say, just having to remind myself of how to connect everything together was always a little bit of a pain because of the gap between each new app I launched, meaning I have to re-learn certain things, take account of package versioning, etc.

9. Local Hardware Optimization (Pre-Cloud)

Before migrating to AWS, I tuned my local workstation to handle 500k+ pages/day: allocating RAM, managing CPU thermal limits, and throttling concurrency to prevent hardware failure. Learned firsthand how much easier large-scale workloads become once orchestration is in the cloud.

Closing Note

Injuria is the most technically ambitious system I've built, a self-updating, multi-agent, AI-driven intelligence platform that blends large-scale scraping, robust ETL pipelines, and complex relational modeling.

It's modular, cloud-ready, resilient to scale into millions of records, and built with the mindset that data quality is as important as data volume.