Skip to main content

Datasets

Datasets give your project a lightweight relational table that lives alongside flows, documents, and evaluations. Each dataset stores rows of JSON objects validated against a schema, so you can define column names, enforce types, and reuse the data anywhere a flow needs structured inputs.

Why use a dataset?

  • Run consistent batches of agent executions by supplying question/answer or scenario parameters as rows.
  • Track regression suites for flows and compare outcomes across evaluations to see how changes impact quality.
  • Capture reusable tabular references (inventory lists, configuration tables, routing rules) that agents can query or update during a conversation.
  • Snapshot document folders into a searchable table so metadata stays accessible without re-indexing files.

Creating and loading data

Open Data → Datasets → Create Dataset to launch the wizard. You can populate rows in three ways:

Upload structured files

  • Accepts CSV, JSON, or XLSX uploads. For Excel workbooks you choose the worksheet before import.
  • The first data rows are profiled to infer column names, detect types (string, number, boolean, JSON), and build the dataset schema automatically.
  • During import BotDojo converts values into typed JSON so filters, batch mappings, and evaluations can rely on consistent types.

Build from a document folder

  • Point the wizard at any project folder to capture file metadata into a dataset.
  • Each row includes file ID, name, logical path, MIME type, size, and timestamps—ideal for auditing document sets or joining back to evaluation results.

Programmatic creation

  • Flows and agents can create datasets at runtime through the Dataset MCP node and Dataset Item MCP node. Use these tools to seed tables from code, append new test cases, or persist state gathered during an interaction.

Dataset imports run as background jobs. While a job is in progress the dataset appears with a pending status; it switches to completed once every row is processed.

Columns, schema, and editing

  • Each dataset carries a JSON Schema that defines its columns. The schema drives table rendering, validation, and type-aware filtering.
  • Use Edit Dataset to rename the table, add a description, or modify the schema with the inline schema editor. You can introduce new columns, adjust types, or remove unused fields—handy when refining regression inputs.
  • Archiving a dataset removes it from active lists without deleting historical batch or evaluation records that reference it.

Working with rows

  • The dataset view renders a full-featured grid where you can sort, filter, and page through rows using the detected column types.
  • Click SQL Query to run read-only SELECT statements against a temporary view of the dataset. The server rewrites queries to expose each column safely and caps results at 1,000 rows, making it easy to inspect joins and apply ad-hoc filters before launching a batch.
  • Download the table (CSV, JSON, or XLSX) for offline analysis, or archive it when the data set is no longer needed.

Feeding batches and evaluations

  1. Create a dataset with the fields your flow expects—for example, prompt, expected_answer, customer_tier, or any guard-rail signals.
  2. When configuring a batch, map dataset columns to the flow’s input schema. Optional columns can hold ground-truth answers or scoring weights.
  3. Run the batch to execute the flow against every row. Results land in evaluations where you can compare variants, slice by dataset filters, and spot regressions release-over-release.

Because datasets stay versioned inside the project, you can rerun the same table after tweaks to prompts, tools, or model settings, ensuring changes are measured against consistent fixtures.

Using datasets inside agents

Attach the Dataset MCP nodes to an Agent when you want the model to treat a dataset as dynamic memory. Agents can:

  • Search tables for matching rows before answering.
  • Insert or update rows as conversations progress (for example, logging hand-offs or accumulating Q&A pairs).
  • Combine dataset lookups with other Model Context Protocol tools to ground responses in curated, structured data.

This makes datasets a flexible bridge between unstructured document retrieval and deterministic business rules.

Lifecycle tips

  • Give tables descriptive names and tags so batches and evaluations reference the correct dataset explicitly.
  • Keep raw uploads in version control or cloud storage; you can re-import them after schema changes or clone datasets across projects.
  • Archive obsolete datasets to reduce clutter—existing batch histories retain their links even after archival.

Datasets round out the Data section by offering dependable, typed inputs that complement document indexes. Use them whenever your flows benefit from predictable columns, reproducible tests, or agent-accessible tables.