Meta Data and Markdown
In addition to extracting and encoding text in a Vector Index, BotDojo also stores MetaData on each Document and Document Part (when split up with the Index chunking strategy).
This data can be used to pre-filter results when querying a Vector Index, used in flows to generate URLs for citations, etc.
Meta Data
Here is some of the metadata that is stored by default:
- path: The full path of the document, e.g., "/mydata/Ford Model T.txt"
- folder: The folder path of the document, e.g., "/mydata/"
- filename: The filename, "Ford Model T.txt"
- size_bytes: How many bytes are stored in the document, e.g., 37397
- content_type: Mime media types, e.g., "text/plain"
- created_date: Date the file was created, e.g., "2024-04-14T21:41:30.710Z"
- modified_date: Date the file was modified, e.g., "2024-04-14T21:41:30.710Z"
- reference_url: Link to the source content, e.g., "https://docs.botdojo.com/docs/learn/concepts/vector/markdown-meta"
You can extend the metadata by:
- Selecting the Document under Data/Documents and clicking Edit MetaData
- Updating via the API
- Uploading markdown files in Front Matter format.
Markdown and Front Matter
Markdown files are an excellent choice for working with Large Language Models (LLMs) due to their simplicity, readability, and versatility.
BotDojo supports Front Matter in markdown to extract structured metadata from your Markdown by including a block of YAML or JSON at the beginning of the file, separated from the main content by triple-dashed lines (---).
Here's an example of a Markdown file with Front Matter:
---
title: "Introduction to Vector Indexes"
author: "John Doe"
date: "2024-04-15"
tags: ["vector indexes", "embeddings", "semantic search"]
---
# Introduction to Vector Indexes
Vector indexes are a powerful tool for enabling semantic search and retrieval of information using embeddings. By representing text as high-dimensional vectors, vector indexes allow you to find similar documents based on their semantic meaning rather than exact keyword matches.
...
In this example, the Front Matter block contains metadata such as the title, author, date, and tags associated with the document. BotDojo automatically extracts this metadata when indexing the Markdown files, making it readily available for filtering, querying, and displaying additional context alongside the search results.
The benefits of using Markdown files with Front Matter in BotDojo include:
-
Simplified Data Management: By keeping the content and metadata together in a single file, you can easily update and maintain your documents without the need for separate metadata databases or complex synchronization processes.
-
Enhanced Search and Filtering: The extracted metadata can be used to pre-filter search results based on specific criteria, such as tags, authors, or date ranges. This allows for more targeted and relevant search experiences within your AI applications.
-
Improved Context and Citations: The metadata can be used to generate URLs for citations, provide additional context about the document's origin, and display relevant information alongside the search results, enhancing the user experience and credibility of the retrieved information.
-
Seamless Integration with LLMs: Markdown's simple and readable format makes it easy for LLMs to process and understand the content structure. The plain-text nature of Markdown files allows for efficient indexing and embedding generation, enabling powerful semantic search capabilities.
By leveraging Markdown files with Front Matter, BotDojo simplifies the process of managing and indexing structured content for use with LLMs. This approach ensures that your data and metadata remain synced and easily accessible, enabling you to build more effective and user-friendly AI applications.