o7 is an advanced problem-solving agent designed for researchers who want to generate, process, and validate custom Q&A fine-tuning datasets. Built on the AgentForge framework, o7 uses a cognitive chain-of-thought approach to produce well-reasoned, step-by-step analysis. It was originally inspired by the Dignity project, but retains only the essential cognitive loop—removing Discord or other overhead—to keep things streamlined.
o7 supports chain-of-thought reasoning to tackle complex queries, script-based Q&A generation, and flexible configuration through AgentForge’s prompts and multi-agent architecture. You can easily generate datasets from a list of topics, process them with o7’s cognitive loop, validate the model’s answers, and even create a fine-tuning dataset from validated responses.
The system borrows the essential architecture from Dignity (formerly Trinity), which relies on multiple agents (e.g., thought, theory, cot, reflect, and generate) working in concert to produce robust reasoning. o7 removes the Discord-specific integrations found in Dignity, focusing squarely on Q&A dataset generation. You can still customize each sub-agent in custom_agents/o7Agent.py
and update the prompts in .agentforge/prompts
to shape the tone, style, or logic of o7.
After cloning or downloading the repository, make sure you have AgentForge installed:
pip install agentforge
If you’re integrating with external LLM providers (e.g., Anthropic or HuggingFace), set environment variables (like ANTHROPIC_API_KEY
) according to AgentForge’s documentation. This ensures o7 has what it needs to interact with your chosen models.
At a high level, you’ll provide a list of categories or topics (categories.txt
), then generate Q&A pairs for each category. After that, you can aggregate all those Q&As, feed them to o7 for chain-of-thought reasoning, and finally validate o7’s answers. The main scripts to know about are:
qa_generation.py
inqa_gen/
reads each topic fromcategories.txt
and creates initial Q&A pairs in Markdown files inqa_gen/qas
.aggregate_qas.py
then combines those Markdown Q&A files into a single JSON (qa_gen/qas.json
) for easier processing.process_qas_by_category.py
reads that JSON, feeds each question to o7, and records the chain-of-thought reasoning and final answer in JSON files underprocess_qas/o7responses
.aggregate_o7responses.py
(inprocess_qas/
) takes all per-category JSON files inprocess_qas/o7responses
and merges them into one file,o7responses.json
.convert_o7responses_to_md.py
(also inprocess_qas/
) creates Markdown files from each category’s JSON, making it simple to read or share o7’s answers.
This sequence starts with generating raw Q&A, continues with answering them via o7, and finally aggregates everything for quick navigation. A typical pipeline might look like this:
# Step 1: Generate QA pairs from categories.
python qa_gen/qa_generation.py
# Step 2: Aggregate those QAs into a single JSON.
python qa_gen/aggregate_qas.py
# Step 3: Process them with o7’s cognitive loop.
python process_qas/process_qas_by_category.py
# (Optional) Merge individual response files into one JSON.
python process_qas/aggregate_o7responses.py
# (Optional) Convert that merged JSON to Markdown files.
python process_qas/convert_o7responses_to_md.py
In addition to generating and processing Q&A data, o7 can validate its own answers against known “gold” answers and even help you create a fine-tuning dataset.
You’ll find the validator logic in the validator/
folder, though there are a few key steps:
run_validator.py
looks at the “gold” Q&A inqa_gen/qas.json
, compares them to the corresponding answers inprocess_qas/o7responses.json
, and calls a Validator Agent to produce an assessment and score for each response. Results are saved as per-category JSON files invalidator_outputs
.aggregate_validator_outputs.py
merges all those per-category validator outputs into a single JSON file (validator_outputs.json
).convert_validator_to_md.py
creates Markdown summaries of the validator outputs, which is handy if you want a quick glance at how well o7 performed and why.
Once you’ve validated o7’s answers, you can generate a JSONL file for fine-tuning language models by running create_finetuning_dataset.py
. This script reads each validator output, filters out low-scoring responses, and writes the remaining high-quality Q&A pairs to a JSONL file (finetuning_dataset.jsonl
). You can then use this dataset to fine-tune your model.
Here’s a brief idea of how the validation workflow might look:
# Step 1: Validate O7’s responses with a Validator Agent.
python validator/run_validator.py
# Step 2: Aggregate all validation results into one JSON.
python validator/aggregate_validator_outputs.py
# Step 3: (Optional) Convert those validation results to Markdown.
python validator/convert_validator_to_md.py
# Step 4: Create a fine-tuning dataset from the validator outputs.
python validator/create_finetuning_dataset.py
The scripts let you see how o7’s answers compare to ground truth, adjust your chain-of-thought reasoning prompts accordingly, and build a refined dataset for training or fine-tuning external models.
A typical end-to-end workflow might look like this:
- Edit
categories.txt
and add any new topics or questions you’d like to explore. - Run
qa_generation.py
to produce initial Markdown files inqa_gen/qas
. - Run
aggregate_qas.py
to combine those files intoqa_gen/qas.json
. - Process the entire set via
process_qas_by_category.py
, which calls o7’s chain-of-thought logic and writes answers toprocess_qas/o7responses/<category>.json
. - (Optional) Merge individual category responses into
o7responses.json
usingaggregate_o7responses.py
and convert them to Markdown withconvert_o7responses_to_md.py
. - Validate everything with
run_validator.py
, which producesvalidator_outputs/<category>.json
. - Aggregate validation results into a single file using
aggregate_validator_outputs.py
, then optionally generate Markdown summaries withconvert_validator_to_md.py
. - Finally, run
create_finetuning_dataset.py
to build a JSONL file that filters out low-quality answers and keeps only responses above a configurable score threshold.
We welcome any thoughts on improving o7’s reasoning capabilities, expanding the validation features, or making the Q&A generation more powerful. If you have bug reports, enhancement ideas, or code contributions, feel free to open an issue or pull request on GitHub. You can also reach out directly if you have any questions or need guidance.
We hope o7 provides a streamlined, flexible system for generating and refining Q&A datasets with step-by-step reasoning. Happy exploring—and hacking on—this cognitive agent!
References
[1] AgentForge on GitHub
[2] Chain-of-Thought Paper
[3] Reflexion Paper