Tableau Prep Analysis Suite

DIY Data Source Analysis

I recently learned how to use the Tableau Metadata API as the result of a DataDev Quest challenge (link). This API allows you to query your data sources on Tableau Server and learn about your workbooks, data sources and data elements without having to sift through a large export. It is extremely powerful and surprisingly simple to use. I was handed a scenario in the workplace that seemed like a perfect opportunity to showcase the Metadata API.

🌍The Real-World Problem

We’ve all been there: inheriting a collection of Tableau Prep flows with no clear picture of which database fields are critical and which are just taking up space. In enterprise environments with hundreds of database columns across dozens of tables—whether Oracle, SQL Server, PostgreSQL, or MySQL. -It isn’t just inconvenient, it’s a real business problem.

The ideal solution would be Tableau’s Metadata API, but many organizations don’t have Server or Cloud deployments that make this accessible. With a Tableau Creator license, I found myself with the inability to use the Metadata API. I figured there must be another way. I started to research how I might be able to meet my needs programmatically. Rather than export all data sources and try and trace relationships between data elements, I came up with a solution by reverse-engineering what’s actually inside Tableau Prep files.

🔍What’s Really Inside a .tfl File?

Here’s something that might surprise you: Tableau Prep files are essentially ZIP archives with fancy file extensions. Crack one open (literally—rename it to .zip), and you’ll find JSON files, XML configurations, and a wealth of metadata about your transformations.

# What you'll find when you crack open a Prep file
{
  "nodes": {
    "node123": {
      "nodeType": "loadsql",
      "name": "Load Data",
      "relation": {"table": "myserver.schema.CUSTOMERS"},
      "fields": [
        {"name": "CUSTOMER_ID", "type": "integer"},
        {"name": "CUSTOMER_NAME", "type": "string"}
      ]
    }
  },
  "connections": {
    "conn456": {
      "connectionType": "oracle",
      "connectionAttributes": {
        "server": "myserver",
        "schema": "analytics"
      }
    }
  }
}

# What you'll find when you crack open a Prep file
{
  "nodes": {
    "node123": {
      "nodeType": "loadsql",
      "name": "Load Data",
      "relation": {"table": "myserver.schema.CUSTOMERS"},
      "fields": [
        {"name": "CUSTOMER_ID", "type": "integer"},
        {"name": "CUSTOMER_NAME", "type": "string"}
      ]
    }
  },
  "connections": {
    "conn456": {
      "connectionType": "oracle",
      "connectionAttributes": {
        "server": "myserver",
        "schema": "analytics"
      }
    }
  }
}

The challenge is that this information is scattered across multiple files in different formats. Getting a comprehensive picture of field usage requires systematic parsing and relationship mapping —which is exactly what I was able to build. While I do have working knowledge of Python, I found myself needing an LLM to help get me past some of my stuck points. Once I had a workable script, I wasn’t satisfied. I wondered if I could push the boundaries of the LLM (Claude) a bit further. The result was an acceleration of my own learning and a robust user-friendly solution.

🎯No Command Line-Expertise Required

Instead of dealing with complex command-line syntax, users are guided through a number of menu options.

Enter your choice (1-5):
1. All databases (comprehensive analysis)  
2. Oracle only
3. SQL Server only
4. PostgreSQL only
5. MySQL only

Enter your choice (1-5):
1. All databases (comprehensive analysis)  
2. Oracle only
3. SQL Server only
4. PostgreSQL only
5. MySQL only

This helps make the toolkit accessible to business analysts and data stewards who need the insights but may not be comfortable with command-line tools. Users are guided by clear prompts to help arrive at results quickly.

⚖️Tiered Analysis Based on Need

Basic Version: Perfect for quick wins and large-scale analysis:

Processes typical flows in under 30 seconds
Field tracking across all source types
Extracts core field relationships and calculations
Provides essential usage metrics for decision-making
Handles batch processing of dozens of flows efficiently
Ideal for initial review and regular monitoring

Enhanced Version: For when you need the complete story:

Dependency tracking and relationship mapping
Captures comprehensive transformation chains (filters, joins, aggregations)
Detailed reasoning: “Field X is used in a join between tables A and B”
Scores based on complexity and pattern analysis
Tracks complex relationships across transformation steps
Great for migration planning or documentation

Users can decide the depth of analysis based on need. The choice depends on whether you need a quick health check or comprehensive documentation for compliance and migration planning.

🖥️Not Just for Database Sources

While my focus was on Oracle and Tableau Server environments, I expanded each script to support all major database types in addition to flat files. It automatically detects and analyzes:

Databases: Oracle, SQL Server, PostgreSQL, MySQL, Snowflake, BigQuery, and more
Published Sources: Tableau Server data sources
File Sources: CSV, Excel, JSON, and text files
Mixed Environments: Automatically detects flows that combine multiple source types

def get_database_type_from_source(data_source):
    """The toolkit automatically detects your database type"""
    if data_source.startswith('Oracle:'):
        return 'Oracle'
    elif data_source.startswith('SQL Server:'):
        return 'SQL Server'
    elif data_source.startswith('PostgreSQL:'):
        return 'PostgreSQL'
    elif data_source.startswith('MySQL:'):
        return 'MySQL'
    elif data_source.startswith('File:'):
        return 'File Source'
    else:
        return 'Database'  # Generic SQL database

def get_database_type_from_source(data_source):
    """The toolkit automatically detects your database type"""
    if data_source.startswith('Oracle:'):
        return 'Oracle'
    elif data_source.startswith('SQL Server:'):
        return 'SQL Server'
    elif data_source.startswith('PostgreSQL:'):
        return 'PostgreSQL'
    elif data_source.startswith('MySQL:'):
        return 'MySQL'
    elif data_source.startswith('File:'):
        return 'File Source'
    else:
        return 'Database'  # Generic SQL database

🔄Guided Workflow

The set up creates a natural progression from quick check to thorough analysis:

Phase 1: Quick Assessment

python tableau_prep_analyzer_basic.py
# Choose: All databases (1) for comprehensive view
# Result: 6 detailed reports per flow in minutes

python tableau_prep_analyzer_basic.py
# Choose: All databases (1) for comprehensive view
# Result: 6 detailed reports per flow in minutes

Phase 2: Deep Dive (When Necessary)

python tableau_prep_analyzer_enhanced.py  
# Choose: Deep analysis (2) for dependency tracking
# Result: 8+ reports including complexity and relationship analysis

python tableau_prep_analyzer_enhanced.py  
# Choose: Deep analysis (2) for dependency tracking
# Result: 8+ reports including complexity and relationship analysis

Phase 3: Aggregated Results

python database_field_aggregator_enhanced.py
# Choose: Specific database type or comprehensive view
# Result: Cross-flow insights and usage patterns

python database_field_aggregator_enhanced.py
# Choose: Specific database type or comprehensive view
# Result: Cross-flow insights and usage patterns

This three-phase approach provides quick answers as well comprehensive documentation when necessary. Users begin with a broad analysis and drill down when needed.

✔️Practical Results That Matter

Let’s look at what these scripts actually produce (.csv format). This example assumes an Oracle database source, but again the scripts allow for multiple data sources. Instead of guessing which fields matter, you get definitive answers:

Field Name,Data Source,Used,Locations,Usage Reasons
RECORD_NUMBER,Oracle: myserver.analytics.MAIN_DATA,Yes,"entity_flow, summary_flow","entity_flow: Direct output | summary_flow: Join condition"
LEGACY_DEPT_CODE,Oracle: myserver.department.OLD_STRUCTURE,No,reference_flow,"Never referenced in any transformation"
CALCULATED_AMOUNT,Oracle: myserver.department.TRANSACTION_DATA,Yes,processing_flow,"processing_flow: Calculation: Total_Charges (in output)"

Field Name,Data Source,Used,Locations,Usage Reasons
RECORD_NUMBER,Oracle: myserver.analytics.MAIN_DATA,Yes,"entity_flow, summary_flow","entity_flow: Direct output | summary_flow: Join condition"
LEGACY_DEPT_CODE,Oracle: myserver.department.OLD_STRUCTURE,No,reference_flow,"Never referenced in any transformation"
CALCULATED_AMOUNT,Oracle: myserver.department.TRANSACTION_DATA,Yes,processing_flow,"processing_flow: Calculation: Total_Charges (in output)"

The script produces results that identify data source/database elements and help answer questions such as:

Which columns can be safely retired across all platforms?
What is the downstream impact of changing a particular field in a mixed environment?
Which Excel files are still meeting important needs?
Are calculated fields created the same across data sources?

This level of detail enables informed decisions about database cleanup, performance optimization, and migration planning.

💻How to Get Started

Getting started is pretty straightforward:

After downloading the necessary scripts and identifying your source prep flows, run a given script from you command line or preferred location and follow the prompts.

# Quick analysis across all your flows
python tableau_prep_analyzer_basic.py
# Follow the guided prompts:
# 1. Enter your flows directory
# 2. Choose output location (or use auto-generated)  
# 3. Select database filter (1 for all sources)
# 4. Decide whether to clear existing output

# Enhanced analysis with deep insights
python tableau_prep_analyzer_enhanced.py
# Additional prompts guide you through:
# - Analysis depth selection (standard vs deep)
# - Dependency tracking options
# - Pattern analysis preferences

# Quick analysis across all your flows
python tableau_prep_analyzer_basic.py
# Follow the guided prompts:
# 1. Enter your flows directory
# 2. Choose output location (or use auto-generated)  
# 3. Select database filter (1 for all sources)
# 4. Decide whether to clear existing output

# Enhanced analysis with deep insights
python tableau_prep_analyzer_enhanced.py
# Additional prompts guide you through:
# - Analysis depth selection (standard vs deep)
# - Dependency tracking options
# - Pattern analysis preferences

Here’s what you can expect:

# The interactive prompts guide you through:
# Step 1: Basic analysis for overview
print("Starting with basic analysis...")
# Run tableau_prep_analyzer_basic.py
# Choose: All databases (1), auto-generated output

# Step 2: Enhanced analysis for critical flows  
print("Deep dive on complex flows...")
# Run tableau_prep_analyzer_enhanced.py
# Choose: Deep analysis (2), dependency tracking

# Step 3: Enterprise aggregation
print("Generating cross-flow insights...")
# Run database_field_aggregator_enhanced.py
# Choose: Specific database filter or comprehensive view

# The interactive prompts guide you through:
# Step 1: Basic analysis for overview
print("Starting with basic analysis...")
# Run tableau_prep_analyzer_basic.py
# Choose: All databases (1), auto-generated output

# Step 2: Enhanced analysis for critical flows  
print("Deep dive on complex flows...")
# Run tableau_prep_analyzer_enhanced.py
# Choose: Deep analysis (2), dependency tracking

# Step 3: Enterprise aggregation
print("Generating cross-flow insights...")
# Run database_field_aggregator_enhanced.py
# Choose: Specific database filter or comprehensive view

Within minutes, you have detailed CSV reports that show you exactly how each field flows through your transformations, with the ability to filter by specific database types or analyze your entire multi-database environment.

⏱️Performance Expectations

Here’s what you can expect:

Individual Prep Flow Analysis

Simple flows (< 10 nodes): 1-3 seconds
Medium complexity (10-50 nodes): 5-20 seconds
Complex flows (> 50 nodes): 30-90 seconds

Batch Processing

Multiple flows: 3-8 seconds per flow
Prompts allow user to follow steps in the process

The enhanced analysis takes longer but provides significantly more insight. For most use cases, the duration pays off when you consider the manual effort saved.

📈Helpful in Business Environments

Both options can be helpful in multiple scenarios:

Migration Projects: Help understand field dependencies before database upgrades or system transitions.

Compliance Documentation: Automated outputs can be used for documentation and help with regulatory requirements, saving weeks of manual effort especially when using multiple database platforms.

Performance Optimization: Database teams can identify unused fields for cleanup across Oracle, SQL Server, and PostgreSQL environments, resulting in potential performance improvements.

Multi-Database Analysis: For those with mixed database environments, you can get a comprehensive view of field usage patterns across database platforms.

Avoid “Brain Drain”: When team members move to different projects or are no longer with the organization, the detailed reports can help provide comprehensive documentation of calculated field logic, regardless of the database or source.

For teams attempting analyses across multiple database types, here’s a script that automates the entire workflow:

#!/bin/bash
# Enterprise batch processing across database types
for database_type in "Oracle" "SQL Server" "PostgreSQL"; do
    echo "Analyzing $database_type fields..."
    python database_field_aggregator_enhanced.py
    # Interactive prompts guide database-specific analysis
done
echo "Analysis complete for all database types!"

#!/bin/bash
# Enterprise batch processing across database types
for database_type in "Oracle" "SQL Server" "PostgreSQL"; do
    echo "Analyzing $database_type fields..."
    python database_field_aggregator_enhanced.py
    # Interactive prompts guide database-specific analysis
done
echo "Analysis complete for all database types!"

💡My Big Takeaways

Building this tool gave me a better understanding of Tableau Prep’s internal structure:

Consistency: The format of the internal Prep files looked to be consistent across multiple Prep versions. There’s hope that the scripts provided can work with the next Tableau Prep update.
Completeness: All the calculated fields and transformation logic deduced from the Tableau Prep internal files looked to be comprehensive.
Accessibility: The JSON/XML structure allows the user to read the underlying structure, making troubleshooting easier.

⚠️Potential Limitations

This approach isn’t perfect. It depends on Tableau’s internal file structure, which could theoretically change in future versions. Processing time can increase depending upon the complexity of your prep flows, and the results still require the user to be knowledgeable about the source data.

🚀Value Added

For those with complex Tableau Prep environments, this toolkit could be valuable with data governance and documentation. Whether you’re planning a migration, looking to make your flows more efficient, or just trying to understand what your flows are actually doing with your Oracle, SQL Server, PostgreSQL, and MySQL data, having the ability to trace your data transformations is incredibly valuable.

*Ready to analyze your own Prep flows? The complete toolkit with documentation and examples is aimed at helping you get started within minutes of downloading. Check out the kit on my Github profile (link).

DATA[dev].diary

Tableau Prep Analysis Suite

DIY Data Source Analysis

Like this: