Upload information
Supported Information Sources
Cognitive Solutions supports multiple types of information sources, allowing the platform to process and utilize a variety of document formats:
- Text Documents (e.g. PDFs). Commonly used for manuals and formal documents.
- Transcriptions (e.g. .srt, .vtt). Used for video transcriptions.
- Markdown (.md). Lightweight markdownv language for structured text with formatting elements.
Organizing Documents
Documents can be organized hierarchically, enabling flexible management of the content. It is important to set up a structure that fits the assistant's purpose. For example:
- Create sections for different topics (e.g., "Product Manuals," "Policies") by accessing Information > Topics > New.
- Subsections can be used to define smaller subsets of information that may be relevant for specific queries or contexts. To create a Subsection when adding a new topic, you must select a parent node for this section, which will cause the topic to be created as a Subsection. Remember that this is a tree-like structure, so you can have as many subsections as you want.
Uploading and Processing Documents
The platform supports the automated ingestion of documents and allows for manual adjustments when needed. The general process includes:
-
Automatic Conversion to Markdown Format:
When documents are uploaded (e.g., PDFs), they are automatically converted into a Markdown format for better parsing and indexing. This ensures that content is ready for extraction by the assistant. -
Ingestion configuration: When uploading a document, you can select an existing Ingestion Configuration, which defines a set of predefined parameters for document processing. This feature allows you to reuse previously saved configurations, making the upload process faster and more consistent by avoiding the need to manually re-enter or recall the optimal settings for each case.
Even after selecting an existing configuration, the system allows you to modify individual fields if you wish to adjust specific parameters before completing the upload.
-
Preprocessing:
PDF parsing method:Users can configure how PDFs are processed and transformed into Markdown before chunking. This step determines how the system extracts text, images, and structure from the uploaded document. Depending on the document’s characteristics (e.g., scanned vs. digital, structured vs. free-form), different processing methods may yield better results.
The available options combine language model processing (OpenAI), programmatic extraction (Manual), and Optical Character Recognition (OCR) for scanned documents:
- Manual – Uses a deterministic parser to extract selectable text and embedded images, exporting images to a media folder and generating page-level Markdown. Preserves reading order and is fast/consistent on digital PDFs. Best for: born-digital PDFs with clear text layers and standard layouts.
- OpenAI – Renders each page to an image and asks an LLM to reconstruct the page as Markdown (headings, lists, tables). The processor keeps previous-page context to maintain heading continuity and structure; if a page fails, it is skipped and logged so the rest of the document still succeeds. Best for: complex layouts, mixed content, or PDFs where semantic structure matters more than raw text fidelity.
- Manual + OpenAI – First performs Manual extraction, then feeds the extracted text to the LLM as guidance to refine structure, fix formatting, and align sections. Best for: digital PDFs where you want the stability of deterministic parsing plus AI clean-up.
- OCR – For scanned PDFs (no selectable text), runs optical character recognition to extract text and produce Markdown. Layout fidelity is limited to what OCR can infer. Best for: scans, photos, or faxes.
- OCR + OpenAI – Runs OCR to get the text, then uses the LLM to structure and normalize the output (headings, lists, tables), improving consistency and readability. Best for: scanned PDFs that also need semantic clean-up.
The option Additional instructions for processing applies only when processing PDFs with OpenAI methods (e.g., OpenAI or OCR + OpenAI). This option lets you add short, custom instructions to the prompt that guide how each page’s Markdown is generated before chunking. Examples:
-
“Treat bold lines as ### subheadings.”
-
“Convert tables into Markdown tables and keep numeric alignment.”
-
“Ignore page footers and repeated headers.” Useful for refining structure and improving Markdown quality in complex or inconsistent layouts.
-
Notes and behavior:
- Output is Markdown per page; embedded images are exported alongside the text.
- Pages that can’t be processed by the AI step are reported (not processed pages), and the pipeline continues for the rest.
- Choosing the right method depends on PDF type (digital vs. scanned), layout complexity, and whether you prioritize speed & consistency or semantic structure.
-
Transcription documents: Upon upload, each transcription segment is converted into a separate block. The system extracts the associated timestamps from the file and stores them as block metadata, preserving the temporal structure of the original transcript.
-
Chunking:
The system divides content into semantic blocks (chunks) based on structure and context. For example, A long paragraph may be broken into smaller chunks based on sentences, sections, or topics. While the platform automatically processes and splits documents, you can manually adjust chunking configurations if needed. This includes:- Chunk separators: Define the order of delimiters the system uses to split text naturally. Common examples: \n\n (paragraphs), \n (lines), section headers (#, ##, ###), or even a space (" "). Include separators that match your document format (for instance, “## ” for Markdown section titles).
- Chunk size: the maximum length of each chunk in tokens (OpenAI tokens). Choose a size that aligns with a meaningful unit (e.g., a section or subsection). We recommend 800–1,200 tokens for general semantic search; 300–600 for precise tasks. You can estimate token counts here: https://platform.openai.com/tokenizer
- Chunk overlap: Specifies how many tokens are shared between consecutive chunks to avoid splitting concepts mid-sentence. Useful for maintaining context across boundaries when sections reference each other. We suggest 10-20% of the chunk size.
- Chunking mode: When processing text, you can choose between different chunking methods that define how the document is split into smaller sections (chunks) for embedding or retrieval.
- LangChain – Uses the RecursiveCharacterTextSplitter, which divides text by tokens and separators for efficient size control.
- Agentic Splitter – Uses an LLM to split the text semantically and generate contextual metadata such as guiding questions for each chunk. Choose the method that best fits your use case — LangChain for performance and consistency, or Agentic Splitter for richer semantic context.
Loading and Managing Content
-
Upload Process:
- Documents can be uploaded through the administration portal.
- During upload, documents are automatically processed and converted into a structured format for easy retrieval.
-
Real-time Updates:
- New documents or updates to existing ones can be ingested without interrupting the assistant’s operation.
- The system provides real-time document ingestion, ensuring that the assistant always has access to the most current information.
-
Preview and Validation:
Before finalizing the upload, you can preview the documents to ensure proper chunking and formatting. The platform offers a preview feature where you can verify the block structure and make necessary modifications.
Bulk upload
The Bulk Upload feature allows you to upload multiple documents at once using the same configuration applied to all of them. This process works the same way as a single document upload, but the ingestion settings (such as parsing method, chunking mode, and other options) will be uniformly applied across every file in the batch.
To attach metadata to each document, you must upload a excel file or json specifying the metadata keys and values. You can download a template to ensure the correct format — each row corresponds to a document, and each column represents a metadata key-value pair.
Troubleshooting Content Uploads
In case the assistant is not providing accurate responses due to content issues, the following common problems should be checked:
- Missing Information: Ensure the relevant data is included in the uploaded documents.
- Chunking Errors: Adjust the chunking size or reconfigure separators.
- Content Formatting: If the document is not correctly interpreted, adjust the preprocessing steps or manually adjust chunks.
See more about troubleshooting in the troubleshooting section.