From PDFs to APIs: Why Indian Court Data Is Finally a Dataset, Not a Document

For most of its history, the Indian court record was a document. You could hold it, scan it, print it, email it. You could not query it, join it, or stream it. That is the quiet transformation eCourts has delivered over two decades. India’s judicial output is becoming a dataset, and the difference is more than technical. A document is read by humans, one at a time. A dataset is read by machines, at any scale, for any purpose. This post explains why that shift matters, what still stands in the way, and why the private aggregation layer exists.

Document versus dataset, in plain terms

A PDF of an order is a document. The same order, with fields for court, case number, parties, judge, date, disposition, and full searchable text, is a row in a dataset. Twenty such rows give you a table. Twenty crore such rows, updated daily across 29,600+ courts, give you the foundation of an entire industry.

The question every legaltech founder eventually asks is this: what percentage of the Indian court record is actually dataset-grade today? The honest answer is that the metadata layer is close to fully dataset-grade. Every case on the eCourts portal has structured fields for court, type, number, year, parties, filing date, status, next hearing, and so on. The content layer, which is the actual text of orders and judgments, is partially dataset-grade. Supreme Court and High Court orders are mostly text-extractable PDFs. District court orders vary widely. Historical records from before 2010 are largely scanned images.

Why the shift took two decades

Turning India’s court record into a dataset is a harder problem than it looks. Consider the moving pieces.

Over 29,600 courts across 37 states and union territories, each with local workflows.
22 scheduled languages plus English, with case files drafted in regional scripts.
Decades of legacy paper that predates any digital workflow.
Dozens of court types, from Supreme Court to taluka to tribunals like NCLT and NCLAT.
Confidentiality rules that require redaction in certain matter categories.
Infrastructure constraints in rural taluka courts that only recently got reliable connectivity.

Given that list, the fact that the public eCourts portal now hosts 26+ crore case records and the National Judicial Data Grid publishes daily numbers is an extraordinary achievement. No other democracy of India’s scale has a comparable system. We covered the full story in our deep dive on the eCourts project.

What dataset-grade unlocks

Once court data is a dataset instead of a document, a set of use cases that were previously impractical become not just possible but economically attractive. Some examples.

Use case	Why it needed a dataset, not a document
Automated litigation tracking for 500+ matters	Requires daily delta queries, not manual checks. Only possible with structured, queryable data.
Background verification on counterparties	Needs pattern search across party names, PAN, address across jurisdictions. Not feasible document-by-document.
Credit and risk scoring that includes litigation exposure	Needs to join court data with KYC and credit systems. Requires APIs.
Judge analytics and bench research	Needs aggregation of years of orders per judge, clustered by matter type and outcome.
AI legal assistants and research agents	Needs real-time API access and semantic search over full text.
Compliance and fraud monitoring for regulators	Needs cross-industry joins with financial data, always-on coverage, structured outcomes.

Every one of these use cases existed before. They were served by manual research teams, expensive law firm retainers, or scanning vendors with spreadsheets. Dataset-grade data does not create new needs. It makes existing needs solvable at 10x to 100x lower cost and 1000x higher scale.

Where the friction still sits

The public portal is a document interface, not a dataset interface. If you need to track 200 cases across eight High Courts, you can log into the portal, search each one individually, and note the next date. That is manually tractable for a single matter. It is not tractable for a credit team at a mid-size NBFC that needs to scan for litigation on 8,000 prospective borrowers per month.

The gap between document interface and dataset interface is exactly the gap that private aggregators fill. What we do, in practical terms, is five things.

Coverage completeness. We crawl every court, every day, and reconcile the result into a single schema.
Structured entity resolution. We link parties across cases, so the same company or individual shows up once, not as 40 near-matches.
Full-text search across the corpus. We extract text from orders and judgments and make it searchable by clause, citation, and keyword.
APIs and webhooks. We serve the data as developer-grade endpoints with SDKs, authentication, and change notifications.
Enterprise-grade SLAs. We run redundancy, monitoring, and on-call so the data is available when the user’s workflow needs it.

None of this replaces what the courts or the DoJ do. It sits above the foundation and turns the public record into a commercial-grade dataset that product teams can build on.

The AI layer needs dataset-grade input

Large language models are pattern machines. Feed them a PDF and they can summarise it. Feed them a dataset and they can reason across it, answer questions, identify precedent, flag risk, draft briefs that reflect the current posture of a matter. The difference between a single-document LLM workflow and a dataset-backed LLM workflow is the difference between a research assistant and a research platform.

This is why we built the eCourts MCP. The Model Context Protocol is a clean way for AI agents to query structured data sources in real time. By exposing our dataset through MCP, an AI assistant can ask questions like “how many cases are pending in Delhi High Court where Company X is a respondent” or “pull the last five orders in this matter” and get a clean, structured answer back, not a scraped PDF.

What this means for eCourtsIndia

Our job is to take the dataset-ification of Indian court data to its logical conclusion. We run continuous coverage across 37 states and union territories. We serve 26.8 crore case records, 29 lakh advocate profiles, and a growing stream of orders and judgments through both a REST API and the eCourts MCP. As Phase III matures and more legacy records move from scanned paper to searchable text, the dataset gets deeper, and the applications that can be built on it get more interesting.

If you are building a product that touches Indian litigation data, start with a dataset-grade source. Explore coverage and API access at eCourtsIndia.com. If you want to plug court data directly into your AI agent stack, our MCP is a 10-minute integration.

Sources

eCourts Services Portal, ecourts.gov.in
National Judicial Data Grid, njdg.ecourts.gov.in
Department of Justice annual reports 2022-23 and 2023-24
eCourtsIndia.com internal coverage dataset, April 2026