Skip to main content

Data

BotDojo keeps your project data close to the flows and agents that use it. Every project owns a private file structure where documents, uploads, generated assets, and session artifacts live. That structure is the foundation for building semantic indexes that Retrieval Augmented Generation (RAG) nodes can query at run time.

Use this section to understand how documents, indexes, and metadata work together. Then dive deeper into the companion guides on creating an index, managing an index, chunking strategies, metadata, integrations, web scraping, and the Data API.

How BotDojo Organizes Data

Each project maintains a document store that looks and feels like a folder tree. Documents can arrive from manual uploads, flow outputs, or automated Data Loaders. Once a document is saved, BotDojo tracks its version, metadata, and any indexes that reference it.

  • Document folders mirror the logical structure you choose (for example, /knowledge-base/policies/2024/).
  • Documents keep the original file bytes, typed metadata, and a generated path-based identifier.
  • Indexes bundle one or more folders (plus optional filters), creating a semantic search surface that your flows can query.
  • Index documents and parts record exactly how a document was chunked and embedded so results can be filtered or refreshed without reprocessing everything.

Data Loaders and Synchronization

Data Loaders connect BotDojo to external systems—think knowledge bases, ticketing tools, or bespoke APIs. Each run pulls records from the source, transforms them into BotDojo documents, and drops them into the document store.

Key behaviors:

  • Loaders keep source identifiers, timestamps, and any custom fields in document metadata so repeated runs can detect changes.
  • You decide where synchronized files land by assigning a destination folder when configuring the loader.
  • Indexes that target a loader-managed folder automatically pick up new or updated documents during the next indexing pass.

Built-in connectors cover common systems out of the box:

  • Google Drive, OneDrive, and Box mirror folders (including shared drives) and retain sharing links in document metadata.
  • Zendesk Help Center converts articles—per locale—into Markdown so support flows surface the latest knowledge-base answers.
  • Salesforce executes SOQL queries to snapshot CRM data and optionally prune records that disappear between runs.
  • Playwright Web Scraper renders dynamic sites in a headless browser, captures Markdown plus binary assets, and supports advanced crawl rules.

You can mix these connectors with manual uploads or API-driven documents in the same project folder.

Explore connectors in Integrations, the Web Scraping guide, and the Data API reference for end-to-end setup details.

Documents and Their Metadata

When a document is added to the store, BotDojo captures both system metadata and anything supplied by the loader or uploader:

  • System fields include filename, logical path, folder, MIME type, file size, and created/modified timestamps.
  • Loader fields often add source URLs, record identifiers, tags, or custom attributes relevant to the downstream workflow.

To power reliable citations, include a meta.reference_url for the canonical link and either meta.reference_title or meta.title for the display name. Retrieval nodes pass those fields through to the Agent node so chat transcripts and the Show Sources panel can present clickable, well-labeled references.

All metadata is stored alongside the document (meta JSON) and copied into every index part created from it. That means you can later filter search results by attributes such as department, language, or record_type without reprocessing the source file. For hands-on tips about editing metadata or using Markdown front matter, see MetaData and Markdown.

Indexes, Parts, and Embeddings

Creating an index turns a folder (or filter) full of documents into a searchable semantic surface:

  1. BotDojo selects the documents that match the index folder and filter rules.
  2. Each document is chunked into ordered index parts using the configured text splitter or custom chunker flow. A part inherits document metadata and adds part-specific details like sequence number and reference URL.
  3. The embedding model assigned to the index converts each part into a high-dimensional vector and stores it in the index’s vector store.
  4. The index keeps track of which document versions are live so it can re-embed only the pieces that changed.

You can create multiple indexes over the same folder structure to experiment with different embedding models, chunking strategies, or metadata filters. The Creating a Vector Index guide walks through that workflow step by step.

Custom Chunking Flows

Selecting Custom chunking routes each document through a flow designed with the chunker schemas. The chunker flow receives:

  • The document text and baseline chunks created by the default splitter.
  • Document metadata (including loader fields and front matter).
  • The index and storage context needed to fetch related assets.

The flow can then:

  • Merge, reorder, or rewrite chunks before indexing.
  • Enrich chunk-level metadata (for example, adding meta.section_title or tagging content with regulatory categories).
  • Call other actions—summaries, classifiers, or translators—so the final chunks better match your retrieval needs.

When the flow returns, BotDojo saves the emitted chunks as index parts and embeds them using the model configured on the index. Custom chunking is ideal when you need domain-specific structure, multilingual segmentation, or context-aware summaries. Learn more in Chunk Strategies, which covers built-in options and the catalog of reusable chunker flows.

How Embeddings Work (at a glance)

An embedding model turns text into a list of numbers (a vector). Vectors that encode similar ideas appear close together in that mathematical space. When a flow queries an index, BotDojo embeds the query text, searches for the nearest vectors, and returns the matching document parts with their metadata and relevance scores. Those parts are then injected into the flow’s prompt as retrieval context.

Managing Indexes

After an index is created, the Index view lets you:

  • Update Index to (re)embed new or changed documents.
  • Download Index to export embeddings.
  • Test Index with sample queries and inspect individual results.
  • Edit Index metadata, tags, or auto-update settings.
  • Browse indexed files to see how each document was chunked, which is especially useful when validating custom chunkers.

See Managing Your Index for screenshots and UI tips.

Metadata for Filtering and Governance

  • Document metadata persists from source through indexing, letting you filter indexes at build time (for example, only include documents where meta.region = "EMEA").
  • Index metadata (the index record’s meta, tags, and folder settings) captures how the collection should be treated—such as labeling the dataset, noting data owners, or pinning retention policies.
  • Index part metadata is searchable at query time: flow nodes can provide a metadata filter string so only parts that match fields like meta.product = "Atlas" are returned.

This layered metadata model makes it easy to route the right content to the right agents while keeping audit trails intact.

Session Files for Flows

Every flow session, including ad-hoc chat sessions, receives its own writable folder under the project’s document store. Nodes can drop generated files there, and end users can upload supporting assets mid-conversation. Those session files behave like any other document—you can index them, share links, or reuse them in later steps.

Putting It Together

  1. Organize data in project folders or let a Data Loader keep them updated.
  2. Create indexes that target those folders and choose the embedding model and chunking (simple or custom) that matches your use case.
  3. Manage indexes to update embeddings, test quality, and refine metadata filters as your domain evolves.
  4. Query the index from flows or agents, optionally filtering by metadata, to ground responses with authoritative documents.

With this pipeline in place, BotDojo handles the heavy lifting—synchronizing data, tracking document history, maintaining embeddings, and serving fast semantic lookups for your flows.