ShingleAI lets you upload files — contracts, invoices, product sheets, runbooks — and makes them searchable by your agents. This page covers what happens between “I dropped a PDF in” and “my agent quoted the right paragraph back to me.”Documentation Index
Fetch the complete documentation index at: https://docs.shingleai.com/llms.txt
Use this file to discover all available pages before exploring further.
What is a File?
A file is any artefact you’ve uploaded to your organization. Each file:- Lives in R2 (Cloudflare’s object store).
- Belongs to an organization, optionally inside a folder.
- Has metadata — filename, content type, size, tags, description.
- Has an indexing status that tracks where it is in the retrieval pipeline.
Folders
Files can be grouped into folders. Folders are useful for two reasons:- Organisation. Same as folders anywhere — group what belongs together.
- Agent access control. Each file (and each folder) carries an
agentAccessEnabledflag. Set the flag on a folder to “off” and your agents will not see anything in it during retrieval, even if a query would otherwise match.
The Indexing Pipeline
When a file is uploaded, an asynchronous workflow takes it from raw bytes to retrievable knowledge:- Upload — the file is written to R2; a database record is created.
- Parse. PDFs and images are run through Mistral OCR, which produces structured markdown per page. Markdown files skip OCR and are read inline.
- Chunk. The structural markdown chunker splits the document along its natural seams: it walks H1 → H2 → H3 → H4 → H5 → H6 → blank-line paragraphs → sentences → words, in that order, picking the deepest split that keeps chunks within size bounds. Targets are 800 tokens per chunk, with a 1,200-token cap and a ~100-token overlap with the previous chunk so context isn’t lost at boundaries. Code fences are kept intact.
- Embed. Each chunk is sent to OpenAI’s
text-embedding-3-small(1536-dimensional vectors), in batches. - Index. Chunks land in PostgreSQL with
pgvectorfor vector search and a tsvector column for keyword search.
indexingStatus walks through pending → indexing → indexed. If a file isn’t indexable (unsupported format, too large, parser failure) the status ends in skipped, excluded, or failed — visible on the file’s detail view so you know it’s not searchable.
How Agents Retrieve Files (RAG)
When an agent needs information from your files, it doesn’t just do a vector search. Pure embedding search is good at recall on fuzzy queries but bad at exact-term matches; pure keyword search is the opposite. ShingleAI uses hybrid retrieval — both, then fused. Two retrievers run in parallel:- EmbeddingRetriever — embeds the query, runs an approximate-nearest-neighbour search in pgvector, returns the top 50 chunks by cosine similarity.
- BM25Retriever — runs a
ts_rank_cdkeyword search against the chunk text, returns the top 50 chunks.
1 / (60 + rank) across both rankings. The top 10 fused chunks go back to the agent.
Two important guarantees, enforced in SQL on every retrieval:
- Only chunks from the agent’s own organization are returned.
- Files (and folders) with
agentAccessEnabled = false, or withindexingStatus ≠ indexed, or that have been soft-deleted, are silently excluded.
What it Costs
Two consumables on your account meter file usage:storage— the total bytes you have stored in R2.text-embedding-3-small— embedding calls made during indexing.
Related Topics
Upload Files
How to add files in the web app
RAG Indexing
Re-index files, exclude folders, troubleshoot
Agents
How agents use files at runtime
Credits & Billing
What storage and embeddings cost