Files & Knowledge

ShingleAI lets you upload files — contracts, invoices, product sheets, runbooks — and makes them searchable by your agents. This page covers what happens between “I dropped a PDF in” and “my agent quoted the right paragraph back to me.”

What is a File?

A file is any artefact you’ve uploaded to your organization. Each file:

Lives in R2 (Cloudflare’s object store).
Belongs to an organization, optionally inside a folder.
Has metadata — filename, content type, size, tags, description.
Has an indexing status that tracks where it is in the retrieval pipeline.

Today, the indexing pipeline understands four formats end-to-end: PDF, PNG, JPEG, and WebP (via OCR), plus inline Markdown. Other formats can still be uploaded and shared via public links — they simply aren’t indexed for retrieval.

Folders

Files can be grouped into folders. Folders are useful for two reasons:

Organisation. Same as folders anywhere — group what belongs together.
Agent access control. Each file (and each folder) carries an agentAccessEnabled flag. Set the flag on a folder to “off” and your agents will not see anything in it during retrieval, even if a query would otherwise match.

This matters more than it sounds: it’s how you separate, say, your public sales collateral from internal HR documents when both are in the same organization.

The Indexing Pipeline

When a file is uploaded, an asynchronous workflow takes it from raw bytes to retrievable knowledge:

Upload — the file is written to R2; a database record is created.
Parse. PDFs and images are run through Mistral OCR, which produces structured markdown per page. Markdown files skip OCR and are read inline.
Chunk. The structural markdown chunker splits the document along its natural seams: it walks H1 → H2 → H3 → H4 → H5 → H6 → blank-line paragraphs → sentences → words, in that order, picking the deepest split that keeps chunks within size bounds. Targets are 800 tokens per chunk, with a 1,200-token cap and a ~100-token overlap with the previous chunk so context isn’t lost at boundaries. Code fences are kept intact.
Embed. Each chunk is sent to OpenAI’s text-embedding-3-small (1536-dimensional vectors), in batches.
Index. Chunks land in PostgreSQL with pgvector for vector search and a tsvector column for keyword search.

A file’s indexingStatus walks through pending → indexing → indexed. If a file isn’t indexable (unsupported format, too large, parser failure) the status ends in skipped, excluded, or failed — visible on the file’s detail view so you know it’s not searchable.

How Agents Retrieve Files (RAG)

When an agent needs information from your files, it doesn’t just do a vector search. Pure embedding search is good at recall on fuzzy queries but bad at exact-term matches; pure keyword search is the opposite. ShingleAI uses hybrid retrieval — both, then fused. Two retrievers run in parallel:

EmbeddingRetriever — embeds the query, runs an approximate-nearest-neighbour search in pgvector, returns the top 50 chunks by cosine similarity.
BM25Retriever — runs a ts_rank_cd keyword search against the chunk text, returns the top 50 chunks.

Their two ranked lists are then fused with Reciprocal Rank Fusion (the standard k=60 formula): a chunk’s final score sums 1 / (60 + rank) across both rankings. The top 10 fused chunks go back to the agent. Two important guarantees, enforced in SQL on every retrieval:

Only chunks from the agent’s own organization are returned.
Files (and folders) with agentAccessEnabled = false, or with indexingStatus ≠ indexed, or that have been soft-deleted, are silently excluded.

Your agent never sees what you’ve told it not to see.

What it Costs

Two consumables on your account meter file usage:

storage — the total bytes you have stored in R2.
text-embedding-3-small — embedding calls made during indexing.

Both count against your tier’s allowance and are charged at the relevant overage rate beyond that. See the Credits page for how the meters and tiers fit together.

Upload Files

How to add files in the web app

RAG Indexing

Re-index files, exclude folders, troubleshoot

Agents

How agents use files at runtime

Credits & Billing

What storage and embeddings cost

Getting Started

Concepts

What is a File?

Folders

The Indexing Pipeline

How Agents Retrieve Files (RAG)

What it Costs

Upload Files

RAG Indexing

Agents

Credits & Billing

Getting Started

Concepts

Documentation Index

​What is a File?

​Folders

​The Indexing Pipeline

​How Agents Retrieve Files (RAG)

​What it Costs

​Related Topics

Upload Files

RAG Indexing

Agents

Credits & Billing

What is a File?

Folders

The Indexing Pipeline

How Agents Retrieve Files (RAG)

What it Costs

Related Topics