Document insights
Every uploaded document has a detail page showing how it was ingested. Open it from the documents list by clicking a document's name.
What you'll find
The detail page has three useful surfaces:
About
The document's metadata: file format, size, when it was added, source URL (for crawled webpages and YouTube videos), and the date of the last sync (for Google Drive folders). For URL-backed sources, the link is clickable so you can open the original page in a new tab.
Used in
A list of every chatbot this document is attached to. Useful when you're auditing what content a chatbot can actually answer from, or before deleting a document.
Preview
For supported formats (PDF and most text documents), the original file is rendered inline so you can read the source without leaving the dashboard. For other formats, you can download the file directly.
Chunks
This is the part that matters for answer quality. After upload, documents are split into smaller passages ("chunks") that the chatbot retrieves from when answering questions. The Chunks section shows you each chunk's text, the approximate token count, and the source page (for paginated documents like PDFs).
Why this helps
If a chatbot is giving wrong or thin answers, the cause is almost always one of:
- The right content was never chunked. The document was scanned but OCR was skipped, or a table-heavy section ended up empty.
- The right chunk exists but is buried. Retrieval isn't surfacing it because of phrasing, terminology, or chunk size.
- The chunk is correct but lacks context. The chunk text on its own doesn't carry enough of the surrounding section to be useful.
The chunks panel lets you check all three in seconds. Search the page for the phrase you expected to be retrieved — if you can't find it, it isn't in the index; if you find it, you know the issue is in retrieval or prompting, not parsing.
What gets parsed
DocuChat handles a wide range of input formats:
- PDFs: text-based PDFs are parsed natively, scanned (image-only) PDFs are processed with OCR, and multi-column documents preserve their reading order.
- Office documents: DOCX, PPTX, and XLSX are parsed with layout-aware extraction that pulls structured data out of tables.
- Webpages and crawls: rendered to text with the original heading structure preserved.
- YouTube videos: transcripts are pulled when available.
- Plain text formats: TXT, Markdown, CSV, and similar formats are parsed directly.
If a document didn't process correctly, the detail page surfaces the failure reason so you can decide whether to re-upload, change format, or split the file.