Welcome to StringSight¶
Extract, cluster, and analyze behavioral properties from model traces

StringSight helps you understand how different generative models behave by automatically extracting behavioral properties from their responses, grouping similar behaviors together, and quantifying how important these behaviors are.
What is StringSight?¶
StringSight is a comprehensive analysis framework for evaluating and comparing Large Language Model (LLM) responses. Instead of just measuring accuracy or overall quality scores, StringSight:
-
Extracts behavioral properties - Uses LLMs to identify specific behavioral traits in model responses (e.g., "provides step-by-step reasoning", "uses technical jargon", "includes creative examples")
-
Clusters similar behaviors - Groups related properties together to identify common patterns (e.g., "Reasoning Transparency", "Communication Style")
-
Computes cluster statistics - Compute statistics on these clusters to understand:
- Which behaviors are most prominent?
- Which behaviors are seen in some models more than others?
- Which behaviors are correlated with any metrics that are provided?
-
Provides insights - Explains why your model is failing, compare the behaviors of different models/methods, and find instances of reward hacking.
Key Features¶
StringSight tells you what the heck is going on with your traces with minimal to no prompt tuning on your part. Upload your traces and automatically discover interesting behaviors through the following pipeline:
- Automatic property extraction - LLM-powered analysis identifies behavioral patterns without manual coding
- Clustering - Groups similar behaviors into meaningful clusters
- Statistical analysis - Computes significance testing, confidence intervals, and quality scores
Easily visualize and analyze your traces in our UI:
- Trace visualization: No money or compute required. Upload your data to view and search through your traces.
- Run automatic behavior extraction and explore the insights dashboard:
- Common failure modes
- Model comparison
- Instances of misaligned metrics
We also support:
- Side-by-side analysis - Compare methods with side-by-side comparisons (find behaviors that differ across traces) or extract behaviors per trace
- Multimodal support - Allows for text, image, or interleaved text image conversations
- Fixed-taxonomy labeling - If you have a predefined list of behaviors, LLM-as-judge with predefined behavioral axes
Quick Example¶
import pandas as pd
from stringsight import explain
# Your data with model responses
df = pd.DataFrame({
"prompt": ["What is machine learning?", "Explain quantum computing", "What is machine learning?", ..],
"model": ["gpt-4", "gpt-4", "claude-3", ..],
"model_response": [
[{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning involves..."}],
[{"role": "user", "content": "Explain quantum computing"},
{"role": "assistant", "content": "Quantum computing uses..."}]
[{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning involves..."}],
...
],
"score": [{"accuracy": 1, "helpfulness": 4.2}, {"accuracy": 0, "helpfulness": 3.8}, {"accuracy": 0, "helpfulness": 3.8}, ..]
})
# Extract and cluster behavioral properties
clustered_df, model_stats = explain(
df,
method="single_model",
output_dir="results/"
)
# Compare 2 models side-by-side
clustered_df, model_stats = explain(
df,
method="side_by_side",
model_a="gpt-4",
model_b="claude-3",
output_dir="results/"
)
# View results by uploading results folder to the UI (either stringsight.com or locally)
Use Cases¶
🏆 Model Evaluation & Comparison¶
Compare multiple models to understand their behavioral strengths and weaknesses. Identify which models excel at specific tasks (reasoning, creativity, factual accuracy, etc.).
🔬 Research & Analysis¶
Analyze how model behavior changes across:
- Different prompting strategies
- Model versions/checkpoints
- Fine-tuning approaches
- Temperature settings
How It Works¶
StringSight uses a 3-stage pipeline:
- Property Extraction - An LLM analyzes each response and extracts behavioral properties
- Clustering - Group similar properties using embeddings and HDBSCAN
- Metrics & Analysis - Calculate per-model statistics, quality scores, and significance tests
Installation¶
# Install StringSight
pip install stringsight
# Set API keys
export OPENAI_API_KEY="your-api-key-here"
export ANTHROPIC_API_KEY="your-anthropic-key" # optional
export GOOGLE_API_KEY="your-google-key" # optional
# Launch web interface
stringsight launch
See the Installation Guide for detailed setup instructions.
Deployment Options¶
Local (Simple):
- stringsight launch - Run in foreground
- stringsight launch --daemon - Run in background (persistent)
Docker (Production):
- docker compose up -d - Full stack with PostgreSQL, Redis, MinIO, and Celery workers
See the Deployment Guide for production setup.
Next Steps¶
- Quick Start - Get up and running in 5 minutes
-
User Guide - Learn how to use StringSight effectively
-
Advanced Usage - Custom pipelines and performance tuning
Support¶
- Documentation: You're reading it!
- Issues: GitHub Issues
- Source Code: GitHub Repository
License¶
StringSight is released under the MIT License.