Frequently Asked Questions

Common questions about data analysis, statistics, and troubleshooting

Missing Variables

Why doesn't my column appear in the analysis tabs?

Your column might be excluded for several reasons:

Too many unique values (>95% unique) - Likely an ID column, excluded from categorical analysis
Constant values - All rows have the same value (e.g., EmployeeCount = 1 for all rows)
All NULL values - No data to analyze
Wrong data type detected - Text that should be numeric (check the Data Quality tab)
Date columns - Currently shown in Time Series tab only, not in regular analysis

Why don't some variables appear in chi-square or ANOVA?

Chi-Square: Requires 2+ categorical variables with reasonable cardinality (not too many categories)
ANOVA: Requires at least one categorical and one numeric variable
Correlations: Requires 2+ numeric variables

What makes a column "categorical" vs "numeric"?

Numeric: Contains only numbers (integers or decimals). Used for calculations like averages, correlations.

Categorical: Contains text, or numbers used as categories (e.g., zip codes, phone numbers, product IDs).

💡 The system auto-detects based on data type and patterns. If a numeric column is treated as categorical, it may contain non-numeric values.

Derived Variables

What are derived variables?

Derived variables are custom columns you create using formulas based on existing columns. They are calculated on-the-fly and participate in all analysis features exactly like real columns.

Numeric: Arithmetic expressions — e.g. revenue / quantity
Categorical: IF/THEN/ELSE logic — e.g. IF revenue > 500 THEN "High" ELSE "Low"
Window functions: Calculations across partitions — e.g. running totals, rankings

How do I write IF/THEN formulas?

Use the following syntax for conditional derived variables:

IF condition THEN "value"

IF condition THEN "value" ELSE "other"

IF condition THEN "value" ELSEIF condition2 THEN "value2" ELSE NULL

💡 Use IS NOT NULL and IS NULL to handle missing values. Use ELSE NULL to produce null values for rows that don't meet any condition.

What is the ⚡ badge on a variable card?

The ⚡ badge indicates the derived variable uses a window function (e.g. RANK() OVER (...), SUM(col) OVER (PARTITION BY ...)). Window functions compute values relative to a partition of rows rather than a single row value.

Window function variables use the cached 100K sample for most features — correlations, ANOVA, chi-square, segmentation, and histograms all run at zero extra database cost. Full-table aggregation queries (reading only the relevant columns) run for Pareto, Groupings, Time series, and numeric scatter plots — exactly the same as any other variable. Results from the sample are suitable for exploratory analysis and pattern discovery.

Can I use a derived variable inside another derived variable formula?

Not currently. Derived variable names do not appear in the formula builder column dropdowns. To achieve a composed formula, write the full expression in a single formula. For example, instead of referencing a derived variable Tax in another formula, use the original expression revenue * 0.1 directly.

Do derived variables query my database?

It depends on the feature:

Profile Now (initial): One query to compute statistics and add the variable to the analysis cache
Segmentation, ANOVA, correlations: Uses the cached 100K sample — zero additional queries
Pareto analysis: One full-table aggregation query
Compare Variables (scatter): One query for full-dataset correlation accuracy
File uploads / Google Sheets: No database queries — all computed in memory

Statistical Analysis

What does "statistically significant" mean?

A relationship is "statistically significant" when it's unlikely to be due to random chance (p-value < 0.05).

Important: "Significant" doesn't mean "important" - it just means a pattern exists in your data. You still need to determine if it's meaningful for your business or research.

What do the significance stars mean (★★★)?

★★★ = p < 0.001 (very strong evidence of a relationship)
★★ = p < 0.01 (strong evidence)
★ = p < 0.05 (moderate evidence)

More stars = more confident that the pattern is real, not random.

Why do some correlations show "N/A"?

Correlations can't be calculated when:

Insufficient data overlap between variables (too many missing values)
One or both variables have zero variance (all values are the same)
Too few observations after removing NULL values

What's the difference between correlation and causation?

Correlation: Two variables move together. Example: Ice cream sales and drowning rates both increase in summer.

Causation: One variable directly causes changes in another. Example: Smoking causes lung cancer.

⚠️ Important: This tool shows correlations only. You must use domain knowledge and experimental design to determine causation.

Outlier Detection

How are outliers detected?

We use the IQR (Interquartile Range) method:

IQR = Q3 - Q1

Lower bound = Q1 - 1.5 × IQR

Upper bound = Q3 + 1.5 × IQR

Values below the lower bound or above the upper bound are flagged as outliers.

Should I remove outliers?

It depends on your analysis goals:

Keep them if: They represent legitimate extreme values (e.g., high-value customers, rare events)
Remove them if: They're data entry errors or measurement mistakes
Investigate: Check the Data Quality tab for context before deciding

Why do some columns show no outliers?

Distribution is very uniform (all values cluster together)
Sample size is too small for outliers to be detected
Data has been pre-cleaned or normalized

Segmentation Analysis

What does "Above Baseline" mean?

A segment is "Above Baseline" when its value is at least 10% higher than the overall baseline.

Example: If baseline attrition rate is 16%:

Above Baseline: Segment with ≥17.6% attrition (16% + 10% = 17.6%)
Near Baseline: Segment with 14.4% - 17.6% attrition
Below Baseline: Segment with ≤14.4% attrition (16% - 10% = 14.4%)

What's the "Rule of 3"?

A simple framework for understanding segmentation:

Baseline: Overall rate for the target variable (e.g., 16% overall attrition)
Segment: Rate for a specific subgroup (e.g., 24% for Sales + Overtime workers)
Lift: Percentage difference from baseline (e.g., +50% or +8 percentage points)

How are segments displayed?

The segmentation results always show top insights — the two highest-lift and two lowest-lift segments — regardless of how many total combinations exist. Below that:

When combinations are manageable (≤500 cells): A lift heatmap is shown with colour-coded cells indicating how each segment compares to the baseline
When too many combinations exist: The heatmap is hidden and a warning is shown — the full results with all calculated metrics are still available via CSV export
CSV Export: Always available regardless of cardinality — contains all segment combinations with mean, lift, and count

Statistical calculations (ANOVA, baseline, lift) always use the complete data regardless of what is displayed on screen.

What do the colors and highlights mean?

Sample Size Indicators:

Red highlight: Segments with fewer than 5 observations OR significantly below baseline (low statistical reliability or underperformance)
Green highlight: Segments significantly above baseline (high performance)
No highlight: Segments near baseline or with sufficient sample size performing within normal range
Important: All segments are shown - highlighting indicates either reliability concerns or meaningful deviation from baseline

Baseline Deviation Thresholds:

Significantly Above: ≥10% higher than overall average (green highlight)
Near Baseline: Within ±10% of overall average (no highlight)
Significantly Below: ≥10% lower than overall average (red highlight)
Note: All lifts are displayed - highlights emphasize segments that meaningfully differ from baseline

What are the cardinality limits for segmentation?

Segmentation variables have the following display limits:

Segmentation Variable Limits (Frontend Display Only):

Single variable analysis: No explicit limit (analyzes up to 100,000 rows)
Two categorical variables: Maximum 20 categories each displayed in matrix (prevents matrices larger than 20×20=400 cells)
One categorical + one numeric: No limit on categorical (numeric is automatically binned)

Note: These are frontend display limits only — they control how many categories are shown in the matrix. For database connections, statistical analysis uses the full 100K sample regardless of display limits. For file uploads and Google Sheets, the full dataset is always used. Exported results have no display limits.

💡 Tip: If you have high-cardinality variables (e.g., 50 states), group them into broader categories (e.g., 5 regions) before analysis for better visualization and insights.

Are there any costs associated with database connections?

Yes. When you connect to a database (AWS Redshift, Snowflake, Google BigQuery, etc.), running queries incurs compute charges from your database provider.

Cost Factors:

Query complexity: The EDA profiler runs multiple analytical queries (aggregations, statistics, correlations)
Data volume: Larger tables require more compute resources
Database pricing model: Each provider has different pricing (per-query, per-second compute, etc.)
Analysis depth: Profiling, segmentation, and outlier detection each run separate queries

⚠️ Cost Management: Monitor your database provider's billing dashboard. For large tables, consider profiling a sample first, or use views/subsets to reduce query scope.

Note: File uploads (CSV/Excel) have no query costs as they're processed in-memory.

Compare Variables

What does Compare Variables do?

Compare Variables performs a deep bivariate analysis between any two variables. It automatically detects the variable types and runs the most appropriate analysis:

Numeric × Numeric: Scatter plot with correlation, best-fit line, and p-value
Categorical × Numeric: Segmentation lift chart with ANOVA statistics
Categorical × Categorical: Cross-tabulation heatmap with chi-square

Does Compare Variables query my database?

For most combinations it uses the cached 100K sample — zero additional database cost. The exception is numeric × numeric scatter plots, which run a full-table aggregation query (reading only the two relevant columns) to compute the exact correlation coefficient and best-fit line. This ensures the scatter result reflects the true full-dataset relationship rather than a sample approximation.

How are Compare Variables results displayed?

The display depends on the variable types being compared:

Numeric × Numeric: Scatter plot showing up to 10,000 data points (a random sample of middle points plus the top and bottom 100 extreme values for each variable). The correlation coefficient and best-fit line are calculated from the full dataset. A CSV download is available for the displayed points.
Categorical × Numeric: Bar chart showing segment averages with colour-coded lift indicators, plus ANOVA statistics. When there are more than 50 categories, only the top 50 are shown in the chart and table. A CSV download contains all categories.
Categorical × Categorical: Cross-tabulation heatmap. When either variable has more than 20 unique values, a warning is shown as the relationship may not be meaningful at high cardinality. A CSV download is available for all combinations.

Groupings

What does the Groupings tab show?

The Groupings tab is a multi-way frequency table. Select one or more categorical variables and optional numeric variables to see counts, percentages, and summary statistics for every combination of category values.

It supports mean, median, min, max, standard deviation, and sum for numeric variables within each group.

Does Groupings query my database?

Yes — Groupings always runs an aggregation query that processes every row since it needs to compute exact counts and statistics. Only the selected columns are read. The query is a single GROUP BY aggregation and results are exact, not sampled.

Pareto Analysis

What is Pareto analysis?

Pareto analysis identifies which categories contribute the most to a metric — based on the principle that roughly 80% of outcomes come from 20% of causes.

For example: which 20% of customers generate 80% of revenue? Which product categories drive most of the returns?

What are the display modes?

Grouped (Percentile Buckets): Categories ranked by metric value and grouped into equal-count buckets (e.g. Top 10%, 10-20%, etc.). Best for large numbers of categories.
Individual (Top 50): Each category shown as its own bar, labeled by name. Best for small numbers of categories where you want to see specific names.

💡 If you request more groups than you have categories, the chart automatically switches to individual mode.

Can I use a numeric column as the category?

Yes — numeric columns (such as customer IDs stored as integers) can be selected as the Pareto category. The app automatically casts them to text for grouping. This is useful for identifying which specific customers or products drive the most value.

Does Pareto query my database?

Yes — one aggregation query that processes every row: SELECT category, SUM(metric), COUNT(*) FROM table GROUP BY category ORDER BY total DESC. Only the category and metric columns are read — not the entire table. The result is cached, so switching between Grouped and Individual modes, changing the number of buckets, or exporting does not trigger additional queries.

💡 On columnar databases like BigQuery and Snowflake, you are only billed for the two columns scanned, not every column in the table.

Time Series

What does the Time Series tab analyze?

Time Series groups a numeric variable by date and shows trends over time, including:

Daily averages with trend line
Day-of-week seasonality patterns
Month-of-year seasonality patterns
Autocorrelation (if available)

How do I run a time series analysis?

Select a date/time column from the first dropdown
Select a numeric measurement variable from the second dropdown
Click the Analyze button to run the analysis

💡 The analysis does not run automatically when you change dropdowns — you must click Analyze. This prevents unintended database queries when browsing column options. The chart label always shows which variables are currently displayed.

Does Time Series query my database?

Yes — each unique date × metric pair triggers one aggregation query: SELECT DATE(date_col), AVG(metric) FROM table GROUP BY date ORDER BY date. Results are cached per pair — re-selecting a previously analyzed combination costs nothing. Switching back to a prior pair is instant.

Relationship Analysis

What's Cramér's V?

Cramér's V measures the strength of association between two categorical variables.

Range: 0 (no association) to 1 (perfect association)
Interpretation:
- 0.0 - 0.1: Negligible
- 0.1 - 0.2: Weak
- 0.2 - 0.3: Moderate
- 0.3+: Strong

What's eta-squared (η²)?

Eta-squared measures effect size for ANOVA (how much a categorical variable explains variation in a numeric variable).

Interpretation: Proportion of variance explained
- 0.01 - 0.06: Small effect
- 0.06 - 0.14: Medium effect
- 0.14+: Large effect
Example: η² = 0.18 means the categorical variable explains 18% of the variance in the numeric variable

Why are only some relationships shown?

We filter for quality to show only meaningful relationships:

Only statistically significant (p < 0.05)
Only moderate or stronger effect sizes (avoids noise)
Ranked by strength - strongest relationships shown first

Variable Clustering

What is variable clustering?

Variable clustering automatically groups variables that are highly correlated with each other, helping you identify:

Redundant variables measuring similar concepts
Natural groupings of related features
Candidates for dimensionality reduction
Potential multicollinearity issues in modeling

How does clustering work?

The system uses hierarchical clustering with the following approach:

Distance Metric: Uses 1 - |correlation| so highly correlated variables are "close"
Linkage Method: Average linkage for balanced clusters
Optimal Clusters: Automatically determined using silhouette score (tests 2-10 clusters)
Pattern Detection: Identifies common patterns like "Size & Area", "Age & Time", "Financial Metrics"

What information is shown in the clustering table?

Cluster ID: Unique identifier for each cluster
Variables: ALL variables in the cluster (no truncation)
Count: Number of variables in the cluster
Avg Correlation: Average correlation between variables in the cluster
Pattern: Detected pattern based on variable names (e.g., "Financial Metrics", "Geographic Info")

💡 Clusters with average correlation >0.7 are highlighted as highly correlated groups

When should I use variable clustering?

Use clustering to:

Identify redundant variables before modeling
Understand conceptual groupings in your data
Decide which variables to keep when you have many similar features
Guide feature engineering and selection strategies

Example: If you have 5 variables all measuring customer satisfaction (survey Q1-Q5), clustering will group them together with high average correlation, suggesting you might only need to keep 1-2 representative variables.

Where can I see variable clustering?

PDF Report: Included automatically after the multicollinearity section
Variable Clustering Tab: Interactive exploration of variable clusters

💡 The clustering algorithm only runs on numeric variables with sufficient data

Data Quality

What's considered "high missing data"?

Columns with >50% of values missing (NULL/blank) are flagged as potentially problematic. High missing data can:

Bias statistical analyses
Reduce sample size for certain tests
Indicate data collection issues

What's a "constant column"?

A column where >95% of rows have the exact same value.

Example: Country = "USA" for all 10,000 rows

These columns typically provide no useful information for analysis and can often be removed.

Why is my percentage column flagged?

Columns with "percent" in the name are checked for values >100. This flag appears when:

Values stored as decimals instead of percentages (0.85 vs 85)
Incorrect unit conversions
Data entry errors

PDF Reports

What's included in the PDF report?

Dataset overview (rows, columns, memory usage)
Complete column inventory with missing data indicators
Comprehensive data quality assessment
Outlier detection and summary for numeric columns
Multicollinearity analysis (correlated predictors)
Variable clustering table (groups of related variables with pattern detection)
Target variable analysis with distribution chart
Significant bivariate relationships with visualizations

How do I generate a PDF report?

Click the "Generate PDF Report" button at the top of the profile page
Optionally select a target variable for focused analysis
Click "Generate" - the PDF will download automatically

💡 Generation takes 10-30 seconds depending on dataset size

Can I customize the PDF report?

Currently, the report format is standardized to ensure consistency and quality. Custom sections and filtering options are planned for future releases.

Performance & Data Sampling

How long does profiling take?

Profiling time depends on the number of columns, the number of rows, and — for databases — connection speed and database load. File uploads and Google Sheets typically profile faster than database connections since data is already in memory. Most tables complete within a few minutes.

📊 How We Handle Large Datasets

The approach differs depending on your data source:

File Uploads & Google Sheets (Full Dataset):

Since the complete file is already loaded into memory, all statistical calculations run on the full dataset — no sampling. Correlations, ANOVA, segmentation, histograms, and all other features use every row.

Database Connections (100K Sample Cache):

A single random 100K-row sample is fetched once during profiling and cached. Most analysis features — correlations, ANOVA, chi-square, segmentation, histograms, derived variable analysis — run on this cached sample at zero additional database cost.

Correlation matrix: Cached sample (statistically reliable at 100K rows)
ANOVA: Cached sample
Chi-square: Cached sample
Segmentation (categorical × metric): Cached sample
Histograms: Cached sample

Features That Run Full-Table Aggregation Queries:

These features process every row but only read the relevant columns — on columnar databases like BigQuery and Snowflake you are only billed for the columns selected.

Pareto analysis: Requires exact aggregations across all rows
Groupings tab: Requires exact counts per category combination
Time series: Requires date-grouped aggregation across all rows
Compare Variables (scatter): Processes all rows for exact correlation accuracy
Outlier row detail (View Details): Fetches matching rows from the full dataset

Why this approach? The 100K sample cache eliminates redundant queries for exploratory analysis while keeping costs low for cloud databases. Full-table queries are reserved for features that require exact aggregations where sampling would give incorrect results.

What's the maximum dataset size?

Files: 100MB upload limit
Databases: No hard limit, but may timeout on very large tables (>10M rows)
Recommendation: Sample large tables if >10M rows for faster results

Why is my analysis taking a long time?

Large number of columns (>50)
High cardinality categorical variables (many unique values)
Many unique combinations for segmentation
Database connection latency

Database Query Costs

⚠️ Important: You are responsible for your database costs

When you connect a cloud database (BigQuery, Snowflake, Redshift, Databricks), queries run under your credentials and are billed to your account by your database provider. This app does not charge you for database usage, but your provider does.

Exactly what queries are run when I profile a table?

During initial profiling the following queries are executed once:

Row count: A metadata lookup (instant, free on BigQuery/Snowflake/Redshift) or COUNT(*) fallback
Column statistics: One batched query per table computing COUNT, COUNT DISTINCT, AVG, MIN, MAX, STDDEV for all columns simultaneously
100K random sample: SELECT * FROM table WHERE RAND() < threshold LIMIT 100000 — this is the most significant query; all subsequent analysis reuses this sample
Top values for categorical columns: One query per categorical column
Outlier detection: Uses the cached sample — no additional query

After profiling completes, the 100K sample is cached in memory. Most interactive features (correlations, ANOVA, segmentation, histograms) use this cache and cost nothing additional.

Which features trigger new database queries after profiling?

Feature	Query triggered?	Query type
Correlations, ANOVA, Chi-square	No — uses cache	—
Segmentation analysis	No — uses cache	—
Histogram re-binning	No — uses cache	—
Derived variable (Profile Now)	Yes — once	Statistics query on full table
Compare Variables (scatter)	Yes — per pair	Full-table correlation query
Pareto analysis	Yes — once per category×metric	GROUP BY aggregation (cached after first run)
Groupings tab	Yes — per selection	GROUP BY aggregation
Time series	Yes — per date×metric pair	Date-grouped aggregation (cached after first run)
Outlier details (View Details)	Yes — per column	SELECT * WHERE value outside bounds
PDF report generation	Mostly no	Uses cached profile and sample for almost everything. One query per numeric column only if you selected an id/date variable for the outlier detail section (fetches the actual outlier rows)

Tips for minimising database costs

Profile once — avoid re-profiling the same table repeatedly; the cached results persist for your session
Use Pareto and Groupings deliberately — these always run full-table queries
On BigQuery, the random sample query uses a probabilistic row filter (WHERE RAND() < threshold), which requires scanning all rows to apply the filter — so you are billed for the full table scan during profiling. However, subsequent features use the cached sample and cost nothing additional
File uploads and Google Sheets have no database query costs at all

AI Insights

What is the AI Insights feature?

After profiling a table, you can click AI Insights to generate an AI-powered analysis of your data. Tukey uses Llama 4 Scout, a large language model by Meta, running on Cloudflare Workers AI infrastructure. The AI generates a structured report covering data quality, key findings, and business insights — and you can ask follow-up questions in a conversational interface.

What data does the AI receive?

The AI never receives your raw data. It only receives aggregated statistical summaries — column names, data types, distributions, null percentages, outlier counts, and pre-computed aggregates (such as group means and crosstabulations). These summaries are computed server-side from your cached session data and sent to Cloudflare's infrastructure for inference. No individual rows or personally identifiable information are transmitted.

How accurate are AI Insights?

AI responses may contain errors. The model can misinterpret statistical summaries, draw incorrect conclusions, or produce plausible-sounding but inaccurate statements. Always verify important findings using Tukey's built-in analysis features (Segmentation, Pareto, Variable Clustering, Time Series) before making decisions.

⚠️ AI Insights are for exploration and hypothesis generation — not a substitute for rigorous analysis. Treat them as a starting point, not a final answer.

What can I ask in follow-up questions?

You can ask questions about relationships between columns, distributions, data quality, and business interpretations. For questions involving two columns (e.g. "which gender had the higher survival rate?"), Tukey automatically computes the relevant cross-tabulations and group aggregates from your cached session data and provides them to the AI — so it answers from real computed numbers rather than estimating.

The AI cannot answer questions requiring access to individual rows, external knowledge about your business, or analyses it has not been given data for. For those cases it will suggest the appropriate Tukey feature to run.

Is AI Insights free?

Yes — AI Insights is free during the early access period at no additional cost to you. The feature uses Cloudflare Workers AI's free inference tier. There are no database query costs for AI Insights since all computations run against your cached session data.

Is my conversation stored?

Your AI conversation is stored in your browser session only and is never saved to our servers. If you navigate away from the profile page or refresh, the conversation resets. Each table has its own separate conversation — switching between tables starts a fresh session.

Can I print or save my AI conversation?

Yes — click Print Conversation in the AI Insights panel to open a print-formatted version of your conversation. You can print it or save it as a PDF using your browser's print dialog.

Data Privacy & Security

Is my data stored permanently?

No. All data — uploaded files, Google Sheets content, and database samples — exists only in server memory for the duration of your session. Nothing is written to disk or persisted beyond your session. When your session ends or the server restarts, all data is cleared.

What happens to my database credentials?

Database credentials (host, username, password, tokens) are:

Stored only in your encrypted browser session for the duration of your visit
Used solely to establish the database connection you requested
Never logged to disk
Cleared when your session ends

Can the app modify my database?

No. The app is strictly read-only. It only executes SELECT queries. No INSERT, UPDATE, DELETE, or DROP statements are ever issued. For additional peace of mind, connect using a read-only database user.

Connection & Upload Issues

Why did my database connection fail?

Common causes:

Incorrect credentials (username, password, database name)
Wrong host or port number
Firewall blocking the connection
Database not running or accessible
SSL/TLS requirements not met
VPN required but not connected

Why is my Google Sheets connection failing?

The sheet must be shared with the Google service account or accessible via the OAuth token
Ensure the Google Sheets URL is correct and the sheet is not restricted to specific users
OAuth tokens expire — try reconnecting if you get an authentication error
The first row must contain column headers

Why is my file upload failing?

Check these common issues:

File too large (>100MB limit)
Unsupported format (must be CSV, Excel .xlsx/.xls, or TSV)
File corrupted or empty
Special characters in column names
Encoding issues (try UTF-8)

Charts not rendering?

Troubleshooting steps:

Refresh the page (Ctrl+F5 or Cmd+Shift+R)
Check browser console for JavaScript errors (F12)
Try a different browser (Chrome or Firefox recommended)
Clear browser cache and cookies
Disable browser extensions that might interfere