Quick Start
Get up and running with LMM-Vibes in minutes.
Installation
# Basic installation
pip install lmmvibes
# With optional dependencies for better performance
pip install lmmvibes[full] # includes sentence-transformers, wandb
Requirements: - Python 3.8+ - OpenAI API key for property extraction and embeddings
Set up your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"
Basic Usage
1. Import LMM-Vibes
import pandas as pd
from lmmvibes import explain
2. Prepare Your Data
LMM-Vibes analyzes model conversations to extract behavioral properties. You need either single model responses or side-by-side comparisons.
Single model format (for analyzing individual model behavior):
df = pd.DataFrame({
"prompt": ["What is machine learning?", "Explain quantum computing", "Write a poem about AI"],
"model": ["gpt-4", "gpt-4", "gpt-4"],
"model_response": ["Machine learning involves...", "QC leverages quantum...", "Silicon dreams awaken..."],
"score": [{"accuracy": 1, "helpfulness": 4.2}, {"accuracy": 0, "helpfulness": 3.8}, {"accuracy": 1, "helpfulness": 4.5}]
})
Side-by-side comparison format (for comparing two models):
df = pd.DataFrame({
"prompt": ["What is machine learning?", "Explain quantum computing", "Write a poem about AI"],
"model_a": ["gpt-4", "gpt-4", "gpt-4"],
"model_b": ["claude-3", "claude-3", "claude-3"],
"model_a_response": ["ML is a subset of AI...", "Quantum computing uses...", "In circuits of light..."],
"model_b_response": ["Machine learning involves...", "QC leverages quantum...", "Silicon dreams awaken..."],
"score": [{"winner": "gpt-4", "helpfulness": 4.2}, {"winner": "gpt-4", "helpfulness": 3.8}, {"winner": "claude-3", "helpfulness": 4.5}]
})
The score
column is optional but helps provide context for behavioral analysis.
3. Analyze Model Behavior
The explain()
function runs a 4-stage pipeline: Property Extraction → Post-processing → Clustering → Metrics & Analysis
# For single model analysis
clustered_df, model_stats = explain(
df,
method="single_model",
min_cluster_size=10, # Minimum size for behavior clusters
output_dir="results/test" # Save all results here
)
# For side-by-side comparison
clustered_df, model_stats = explain(
df,
method="side_by_side",
min_cluster_size=10,
output_dir="results/test"
)
What happens during analysis:
-
Property Extraction: GPT-4 analyzes each response and extracts behavioral properties like "Provides step-by-step reasoning" or "Uses creative examples"
-
Clustering: Similar properties are grouped together using embeddings (e.g., "explains clearly" and "shows work" → "Reasoning Transparency" cluster)
-
Metrics Calculation: Computes which models excel at which behavioral patterns, with quality scores and statistical significance
-
Results Saving: All intermediate and final results are saved to your output directory
4. Explore Results
# View the extracted behavioral properties and their clusters
print("Sample behavioral properties found:")
print(clustered_df[['property_description', 'property_description_fine_cluster_label']].head())
# Check which behaviors each model excels at
for model, stats in model_stats.items():
print(f"\n{model} excels at:")
# Show top behavioral clusters for this model
for behavior in stats["fine"][:3]: # top 3 behaviors
print(f" • {behavior.property_description} (score: {behavior.score:.2f})")
5. Interactive Visualization
Launch the Gradio web interface to explore your results interactively:
# View clusters, examples, and metrics in a web interface
python -m lmmvibes.vis_gradio.launcher --share
This opens a web interface where you can:
- Browse behavioral clusters and their examples
- Compare model performance across different behaviors
- Explore the extracted properties and metrics
- Visualize model rankings and statistical analysis
Advanced Usage
Custom Prompts
custom_prompt = """
Analyze this conversation and identify the key behavioral difference.
Focus on: reasoning style, factual accuracy, helpfulness.
"""
clustered_df, model_stats = explain(
df,
method="side_by_side",
system_prompt=custom_prompt,
model_name="gpt-4o",
temperature=0.1
)
Pipeline Configuration
from lmmvibes.pipeline import PipelineBuilder
from lmmvibes.extractors import OpenAIExtractor
from lmmvibes.clusterers import HDBSCANClusterer
pipeline = (PipelineBuilder("Custom Pipeline")
.extract_properties(OpenAIExtractor(model="gpt-4o-mini"))
.cluster_properties(HDBSCANClusterer(min_cluster_size=15))
.configure(use_wandb=True)
.build())
# Use custom pipeline
clustered_df, model_stats = explain(
df,
custom_pipeline=pipeline
)
Data Formats
Expected Columns
Side-by-side:
- question_id
- Unique identifier for each question
- model_a
, model_b
- Model names
- model_a_response
, model_b_response
- Model responses
- winner
(optional) - "model_a", "model_b", or "tie"
Single model:
- question_id
- Unique identifier
- model
- Model name
- model_response
- Model response
- score
(optional) - Numeric quality score
Output Files
When you specify output_dir
, LMM-Vibes saves:
- clustered_results.jsonl
- Complete results with clusters
- model_stats.json
- Model performance statistics
- full_dataset.json
- Complete dataset for reuse
- summary.txt
- Human-readable summary
Next Steps
- Learn about the explain() and label() functions
- Understand the output files
- Explore the API Reference