Output Files¶
Understanding the files generated by StringSight analysis.
When you run explain() or label() with an output_dir, StringSight saves comprehensive results across multiple file formats. This guide explains each output file and how to use them.
Core Output Files¶
1. clustered_results.parquet¶
Primary results file with all analysis data
This is the main output file containing your original data enriched with extracted properties and cluster assignments.
import pandas as pd
# Load complete results
df = pd.read_parquet("results/clustered_results.parquet")
# Key columns added by StringSight:
print(df.columns)
# ['question_id', 'model', 'model_response', # Original data
# 'property_description', 'property_evidence', # Extracted properties
# 'property_description_cluster_id', # Cluster assignments
# 'property_description_cluster_label'] # Human-readable cluster names
# Example: Find all responses in a specific cluster
cluster_data = df[df['property_description_cluster_label'] == 'Detailed Technical Explanations']
Use this file for: - Interactive analysis and visualization - Building custom dashboards - Statistical analysis of results - Feeding into downstream ML pipelines
2. Metrics DataFrames (JSONL)¶
Model- and cluster-level metrics as DataFrames
These files are optimized for frontend and analysis workflows:
model_cluster_scores_df.jsonl— Per model-cluster metricscluster_scores_df.jsonl— Per cluster aggregated metricsmodel_scores_df.jsonl— Per model aggregated metrics
import pandas as pd
model_cluster = pd.read_json("results/model_cluster_scores_df.jsonl", lines=True)
cluster_scores = pd.read_json("results/cluster_scores_df.jsonl", lines=True)
model_scores = pd.read_json("results/model_scores_df.jsonl", lines=True)
# Example: top clusters for a given model
gpt4 = model_cluster[model_cluster["model"] == "gpt-4"]
print(gpt4.sort_values("proportion", ascending=False).head(10)[["cluster", "proportion"]])
Use these files for: - Model leaderboards and rankings - Performance comparisons - Quality assessment reports - Automated model selection
3. full_dataset.json¶
Complete dataset for reanalysis and caching
Contains the entire PropertyDataset object with all conversations, properties, clusters, and metadata.
from stringsight.core.data_objects import PropertyDataset
# Load complete dataset
dataset = PropertyDataset.load("results/full_dataset.json")
# Access all components:
print(f"Conversations: {len(dataset.conversations)}")
print(f"Properties: {len(dataset.properties)}")
print(f"Clusters: {len(dataset.clusters)}")
print(f"Models: {dataset.all_models}")
# Rerun metrics with different parameters
from stringsight import compute_metrics_only
clustered_df, new_stats = compute_metrics_only(
"results/full_dataset.json",
method="single_model",
output_dir="results_updated/"
)
Use this file for: - Recomputing metrics without re-extracting properties - Debugging and troubleshooting - Building analysis pipelines - Sharing complete analysis state
Additional Output Files¶
Processing Stage Files¶
Property Extraction:
- raw_properties.jsonl - Raw LLM responses before parsing
- extraction_stats.json - API call statistics and timing
- extraction_samples.jsonl - Sample inputs/outputs for debugging
JSON Parsing:
- parsed_properties.jsonl - Successfully parsed property objects
- parsing_stats.json - Parsing success/failure statistics
- parsing_failures.jsonl - Failed parsing attempts for debugging
Validation:
- validated_properties.jsonl - Properties that passed validation
- validation_stats.json - Validation statistics
Clustering:
- embeddings.parquet - Property embeddings (if include_embeddings=True)
- clustered_results_lightweight.jsonl - Results without embeddings
- summary_table.jsonl - Cluster summary statistics
Metrics:
- model_cluster_scores_df.jsonl - Per model-cluster performance (DataFrame JSONL)
- cluster_scores_df.jsonl - Aggregate cluster metrics (DataFrame JSONL)
- model_scores_df.jsonl - Aggregate model metrics (DataFrame JSONL)
Summary Files¶
summary.txt - Human-readable analysis summary
StringSight Results Summary
==================================================
Total conversations: 1,234
Total properties: 4,567
Models analyzed: 8
Fine clusters: 23
Coarse clusters: 8
Model Rankings (by average quality score):
1. gpt-4: 0.847
2. claude-3: 0.832
3. gemini-pro: 0.801
...
Working with Output Files¶
Loading Results for Analysis¶
import pandas as pd
import json
# Quick analysis workflow
df = pd.read_parquet("results/clustered_results.parquet")
with open("results/model_stats.json") as f:
stats = json.load(f)
# Analyze cluster distributions
cluster_counts = df['property_description_cluster_label'].value_counts()
print("Top behavioral patterns:")
print(cluster_counts.head(10))
# Compare models within specific clusters
for cluster in cluster_counts.head(5).index:
cluster_data = df[df['property_description_cluster_label'] == cluster]
model_dist = cluster_data['model'].value_counts()
print(f"\n{cluster}:")
print(model_dist)
Rerunning Analysis¶
from stringsight import compute_metrics_only
# Recompute metrics with different parameters
clustered_df, model_stats = compute_metrics_only(
input_path="results/full_dataset.json",
method="single_model",
metrics_kwargs={
'compute_confidence_intervals': True,
'bootstrap_samples': 1000
},
output_dir="results_with_ci/"
)
Building Custom Visualizations¶
# Interactive visualization with plotly
import plotly.express as px
import pandas as pd
# Load results
df = pd.read_parquet("results/clustered_results.parquet")
# Build interactive filters
selected_models = ["gpt-4", "claude-3"] # Filter by model
selected_clusters = df['property_description_cluster_label'].unique()[:10] # Top clusters
# Filter and display
filtered_df = df[
(df['model'].isin(selected_models)) &
(df['property_description_cluster_label'].isin(selected_clusters))
]
# Create interactive plot
fig = px.bar(filtered_df, x='model', y='property_description_cluster_label',
title='Model Behavior Comparison')
fig.show()
File Format Details¶
Parquet vs JSON vs JSONL¶
Parquet (.parquet)
- Binary format, fastest loading
- Preserves data types
- Best for analysis and large datasets
- Use: pd.read_parquet()
JSON (.json)
- Human-readable structure
- Good for configuration and metadata
- Use: json.load()
JSONL (.jsonl)
- Newline-delimited JSON
- Streamable for large datasets
- Each line is a JSON object
- Use: pd.read_json(..., lines=True)
Best Practices¶
1. File Organization¶
results/
├── clustered_results.parquet # Primary analysis file
├── model_cluster_scores_df.jsonl # Per model-cluster metrics (DF JSONL)
├── cluster_scores_df.jsonl # Per cluster metrics (DF JSONL)
├── model_scores_df.jsonl # Per model metrics (DF JSONL)
├── full_dataset.json # Complete state
├── summary.txt # Human summary
├── embeddings.parquet # Embeddings (optional)
└── stage_outputs/ # Detailed processing files
├── parsed_properties.jsonl
├── validation_stats.json
└── ...
2. Version Control¶
- Include
summary.txtandmodel_stats.jsonin version control - Use
.gitignorefor large binary files like embeddings - Tag important analysis runs
3. Reproducibility¶
- Save the exact command/parameters used
- Keep
full_dataset.jsonfor reanalysis - Document any post-processing steps
Next Steps¶
- Use the quickstart guide to generate these files
- Learn about explain() and label() functions
- Explore visualization options
Task Descriptions¶
Task descriptions let you steer property extraction toward a specific domain or evaluation goal. When provided, StringSight formats a task-aware system prompt (for both single_model and side_by_side variants) using templates from stringsight/prompts/extractor_prompts.py.
Example:
clustered_df, model_stats = explain(
df,
method="single_model",
task_description=(
"Evaluate customer support responses for empathy, clarity, "
"resolution accuracy, and policy adherence."
),
output_dir="results/customer_support",
)
Default Task Description (WebDev Arena helper)¶
When running scripts/run_webdev_arena.py, the following default task description is used unless overridden with --task_description (or disabled with --no_task_description):
Each model is given a user prompt to generate a web development project.
When looking for interesting properties of responses, consider the following (note these are not exhaustive):
1. **Code Quality**: Correctness, best practices, security vulnerabilities, and adherence to modern web standards
2. **Completeness**: Whether the implementation fully addresses the user's requirements and includes necessary dependencies
3. **User Experience**: UI/UX quality, accessibility, responsiveness, and visual appeal
4. **Maintainability**: Code organization, documentation, comments, and readability
5. **Functionality**: Whether the code would actually work as intended, proper error handling, and edge case coverage
6. **Performance**: Efficient implementations, loading times, and resource usage
7. **Stylistic Choices**: The model's choices in terms of language, formatting, layout, and style
8. **User interpretation**: If given vauge instructions, what design choices does the model make to try to fulfill the user's requirements?
9. **Safety**: Whether the model's response contains vulernabilities or if it generates content that another model would consider unsafe or harmful.
For full prompt templates, see stringsight/prompts/extractor_prompts.py.