Skip to content

Configuration Guide

Complete guide to configuring StringSight's analysis pipeline for optimal results.

Clustering Parameters

min_cluster_size

What it does: Minimum number of properties required to form a cluster.

How to choose:

Dataset Size Recommended min_cluster_size Rationale
< 100 conversations 5-10 Small datasets need smaller clusters to find patterns
100-1,000 conversations 10-20 Balanced granularity
1,000-10,000 conversations 20-50 Larger clusters filter noise, find robust patterns
> 10,000 conversations 50-100 Very large datasets need substantial clusters

General rules: - Start with dataset_size / 50 as a baseline - Smaller values (5-10) = More granular, specific patterns (risk: noise/overfitting) - Larger values (50-100) = Broader, more robust patterns (risk: missing nuances) - If you get too many clusters: Increase min_cluster_size - If you get too few clusters: Decrease min_cluster_size

Quick tips (concise)

  • If clusters often repeat the same property, increase min_cluster_size.
  • By dataset size (samples):
    • < 100: 3-4
    • 100–1,000: 5-7
    • 1,000: 15-30

Examples:

from stringsight import explain

# Small exploratory dataset (100 conversations)
explain(df, min_cluster_size=3)

# Medium production dataset (1,000 conversations)
explain(df, min_cluster_size=7)  # Default

# Large research dataset (10,000+ conversations)
explain(df, min_cluster_size=25)

embedding_model

What it does: Converts property descriptions to vectors for clustering.

Options:

Model Cost Speed Quality Best For
"text-embedding-3-large" $0.02/1M tokens Fast Very Good Default - best balance
"text-embedding-3-large" $0.13/1M tokens Medium Excellent Production quality analysis
"all-MiniLM-L6-v2" Free Very Fast Good Development, large datasets
"all-mpnet-base-v2" Free Medium Very Good Cost-conscious production
# OpenAI embeddings (requires API key, costs $)
explain(df, embedding_model="text-embedding-3-large")  # Default

# Local embeddings (free, no API calls)
explain(df, embedding_model="all-MiniLM-L6-v2")

assign_outliers

What it does: Assigns properties that don't fit any cluster to their nearest cluster.

When to use: - ✅ You want every property in a cluster (no noise/outliers) - ✅ Dashboards/visualizations (avoids "Outlier" cluster) - ✅ Downstream analysis requires full coverage

When to skip: - ❌ You want to identify truly unique/anomalous behaviors - ❌ Quality matters more than coverage - ❌ Small datasets (outliers are informative)

# Assign all properties to clusters
explain(df, assign_outliers=True)

# Keep outliers separate
explain(df, assign_outliers=False)

Extraction Parameters

model_name

What it does: LLM used to extract behavioral properties from responses.

Options:

Model Cost/Quality When to Use
"gpt-4.1" $$$ / Excellent Production, research papers, high-stakes decisions
"gpt-4.1-mini" $$ / Very Good Default - balanced cost/quality
"gpt-4.1-mini" $ / Good Development, iteration, large-scale experiments
"gpt-4.1-nano" ¢ / Decent Massive datasets, proof-of-concepts
# High quality extraction
explain(df, model_name="gpt-4.1")

# Cost-effective extraction
explain(df, model_name="gpt-4.1-mini")

temperature

What it does: Controls randomness in property extraction.

Values: - 0.0-0.3 = Deterministic, focused extraction - 0.5-0.7 = Default - balanced creativity - 0.8-1.0 = More creative, diverse properties

# Consistent, focused properties
explain(df, temperature=0.2)

# Diverse, creative properties
explain(df, temperature=0.9)

max_workers

What it does: Number of parallel API calls for extraction.

Guidelines: - Default: 16 - Increase (32-64) if you have high API rate limits - Decrease (4-8) if you hit rate limits or want to conserve resources - 1 for debugging (sequential processing)

# Fast parallel extraction
explain(df, max_workers=32)

# Conservative rate limiting
explain(df, max_workers=8)

Model Selection Strategy

Budget-Conscious Configuration

For cost-effective analysis without sacrificing too much quality:

explain(
    df,
    model_name="gpt-4.1-mini",              # Cheap extraction
    embedding_model="all-MiniLM-L6-v2",    # Free embeddings
    min_cluster_size=50,                    # Fewer, larger clusters
    use_wandb=False                         # Turn off W&B (default True)
)

Estimated cost: ~$5-10 per 1,000 conversations

Production-Quality Configuration

For high-quality, reproducible results:

explain(
    df,
    model_name="gpt-4.1",                      # Best extraction
    embedding_model="text-embedding-3-large",   # Best embeddings
    min_cluster_size=30,                         # Balanced granularity
    use_wandb=True,                              # Track experiments (default True)
    wandb_project="production-analysis"
)

Estimated cost: ~$50-75 per 1,000 conversations

Development/Iteration Configuration

For fast experimentation:

explain(
    df,
    model_name="gpt-4.1-mini",            # Fast extraction
    embedding_model="all-MiniLM-L6-v2",   # Fast embeddings
    min_cluster_size=20,                   # Quick clustering
    max_workers=32,                        # Maximize parallelism
    use_wandb=False                        # Skip tracking (default is True)
)

Estimated time: ~5-10 minutes per 1,000 conversations

Note: W&B logging is enabled by default. In the CLI (scripts/run_full_pipeline.py), pass --disable_wandb to turn it off.

Advanced Parameters

Dimensionality Reduction

Control PCA (or no dimensionality reduction) before clustering:

from stringsight.clusterers import HDBSCANClusterer

clusterer = HDBSCANClusterer(
    disable_dim_reduction=True,              # Skip dimensionality reduction
    dim_reduction_method="pca",              # "pca", "adaptive", "none"
)

HDBSCAN Tuning

Fine-tune clustering algorithm:

clusterer = HDBSCANClusterer(
    min_cluster_size=30,
    min_samples=5,                           # Minimum samples in neighborhood
    cluster_selection_epsilon=0.0,           # Distance threshold
)

Stratified Clustering

Cluster separately per group (e.g., per topic, per task):

explain(df, groupby_column="topic")  # Cluster within each topic

Common Configuration Issues

"Too many small clusters"

Problem: Hundreds of tiny, noisy clusters

Solution:

# Increase minimum cluster size
explain(df, min_cluster_size=50)  # was: 10

# Or assign outliers
explain(df, assign_outliers=True)

"Only 2-3 clusters"

Problem: Not enough granularity

Solution:

# Decrease minimum cluster size
explain(df, min_cluster_size=10)  # was: 50

# Use better embeddings
explain(df, embedding_model="text-embedding-3-large")

# Lower temperature for more diverse properties
explain(df, temperature=0.8)

"Clustering too slow"

Problem: Takes hours to cluster

Solution:

# Use local embeddings
explain(df, embedding_model="all-MiniLM-L6-v2")

# Increase cluster size
explain(df, min_cluster_size=100)

"Running out of memory"

Problem: OOM errors during clustering

Solution:

# Disable embeddings in output
explain(df, include_embeddings=False)

# Skip dimensionality reduction
explain(df, disable_dim_reduction=True)

# Process in batches (manually split data)

Quick Reference

By Dataset Size

# < 100 conversations
explain(df, min_cluster_size=3)

# 100-1,000 conversations
explain(df, min_cluster_size=7)

# 1,000-10,000 conversations
explain(df, min_cluster_size=25)  # Default for larger datasets

# > 10,000 conversations
explain(df, min_cluster_size=30)

By Use Case

# Research paper (quality matters most)
explain(df, model_name="gpt-4.1", embedding_model="text-embedding-3-large")

# Production dashboard (speed + quality balance)
explain(df, model_name="gpt-4.1-mini", embedding_model="text-embedding-3-large")

# Exploration/development (speed matters most)
explain(df, model_name="gpt-4.1-mini", embedding_model="all-MiniLM-L6-v2")

Next Steps