K-Core Decomposition for Ultramarathon Data

The Sparsity Problem

Ultramarathon race results present a fascinating modeling challenge. With thousands of races and hundreds of thousands of runners, you might think we'd have plenty of data—but the reality is far more complex. Most runners complete only a handful of races, and most races see only a fraction of the broader running population. This creates an extremely sparse dataset where direct comparisons between runners or courses are often impossible.

Imagine trying to compare a runner who only does Western States with someone who only does Leadville. Without shared race participation, we have no common reference point. This sparsity becomes a critical problem when we want to build statistical models that estimate runner abilities or course difficulties.

The solution? Rather than wrestling with the full sparse dataset, we can identify a densely-connected subset where runners and courses have enough overlap for meaningful inference. This is where k-core decomposition comes in—a graph-theoretic technique that iteratively prunes nodes until every remaining node meets minimum connectivity requirements.

In this notebook, we'll:

Explore the space of possible k-core configurations
Visualize the tradeoffs between subset size and density
Select and examine a specific k-core for downstream modeling

Setup

We begin by loading race results from UltraSignup, a comprehensive database of ultramarathon events. The preprocessing pipeline filters to running events (excluding cycling, swimming, etc.), extracts distance information, and identifies races that report DNFs (Did Not Finish)—a key signal for our later analysis.

We focus on marathon-distance and longer events (≥26.2 miles) since these are the races where DNF patterns become most meaningful and where the ultramarathon community truly begins.

import pandas as pd
import numpy as np
import plotly.graph_objects as go
import igraph as ig

from utils.data_processing import load_results, process_results, filter_races_with_dnfs
from utils.kcore_subsetting import evaluate_all_kcore_candidates, subset_kcore_data

# preprocess data

results = load_results()
results = process_results(results)
results = filter_races_with_dnfs(results)
results = results[results['distance_miles'] >= 26.2]

Exploring the K-Core Parameter Space

What is a (α, β)-Core?

Think of ultramarathon data as a bipartite graph: one set of nodes represents runners, another represents courses (race-distance combinations), and edges connect runners to courses they've completed. An (α, β)-core is the largest subgraph where:

Every runner has completed at least α different courses
Every course has at least β different runners

The parameters α and β control the density-size tradeoff:

Higher α means runners must have more race history → fewer runners qualify
Higher β means courses need more participants → fewer courses qualify

Finding the right balance is crucial. Too loose, and we retain too much sparsity for effective modeling. Too strict, and we throw away valuable data.

Breadth-First Exploration

Rather than testing every possible (α, β) combination, we use an intelligent breadth-first search. Starting from minimal requirements, we incrementally tighten constraints and track how the subset evolves—measuring not just size, but also completeness metrics that indicate how representative the k-core remains of each runner's and course's full history.

# search k-core space

candidates_df = evaluate_all_kcore_candidates(
    results=results,
    max_entities=None,  # explore full parameter space
    min_alpha=2,  # courses per runner
    min_beta=30,  # runners per course
    beta_step=2,  
    verbose=True,
    use_cache=True  
)

✅ Loading cached k-core candidates from candidates_maxentNone_alpha2_beta30_step2.pkl

Visualizing the Parameter Landscape

The 3D visualization below reveals the fundamental tradeoffs in k-core selection. Each point on the surface represents a different (α, β) configuration:

X-axis (Course Completeness): On average, what percentage of each course's total runners are retained in the k-core?
Y-axis (Runner Completeness): On average, what percentage of each runner's race history is retained?
Z-axis: Configurable—toggle between course count, total entities, or total results

Key patterns to observe:

The surface drops off sharply as we move toward higher completeness values—achieving both high course and runner completeness simultaneously requires sacrificing dataset size
The "ridge" of the surface shows the Pareto frontier: configurations where you can't improve one metric without sacrificing another
Hover over points to see the exact (α, β) values and detailed statistics

Use the toggle buttons to explore different perspectives on the data.

def plot_3d_kcore_space(candidates_df: pd.DataFrame, full_results: pd.DataFrame = None):
    """
    Create interactive 3D wireframe visualization of k-core parameter space.
    
    Displays parameter space as a grid wireframe, connecting discrete (α, β) points
    from breadth-first search exploration.
    
    Args:
        candidates_df: All k-core candidates with metrics
        full_results: Full dataset for calculating baseline density reference planes
    """
    df = candidates_df.copy()
    
    # Calculate total entities
    df['n_entities'] = df['n_courses'] + df['n_participants']
    
    # Get unique sorted alpha and beta values
    alpha_values = sorted(df['alpha'].unique())
    beta_values = sorted(df['beta'].unique())
    
    # Create grid lookup with all z metrics
    grid_data = {}
    for _, row in df.iterrows():
        key = (row['alpha'], row['beta'])
        grid_data[key] = {
            'x': row['avg_course_completeness'],
            'y': row['avg_runner_completeness'],
            'z_courses': row['n_courses'],
            'z_entities': row['n_entities'],
            'z_total_results': row['total_results'],
            'n_courses': row['n_courses'],
            'n_entities': row['n_entities'],
            'total_results': row['total_results'],
            'hover': (
                f"α={int(row['alpha'])}, β={int(row['beta'])}<br>"
                f"Courses: {int(row['n_courses']):,}<br>"
                f"Runners: {int(row['n_participants']):,}<br>"
                f"Total Entities: {int(row['n_entities']):,}<br>"
                f"Total Results: {int(row['total_results']):,}<br>"
                f"Overall Density: {row['density']:.1f}<br>"
                f"Runner completeness: {row['avg_runner_completeness']:.1f}%<br>"
                f"Course completeness: {row['avg_course_completeness']:.1f}%"
            )
        }
    
    # Determine color scale ranges for each metric
    min_courses = df['n_courses'].min()
    max_courses = df['n_courses'].max()
    min_entities = df['n_entities'].min()
    max_entities = df['n_entities'].max()
    min_results = df['total_results'].min()
    max_results = df['total_results'].max()
    
    fig = go.Figure()
    
    # Helper function to create traces for a given z metric and color metric
    def create_traces(z_key, color_key):
        traces = []
        
        # Get color metric min/max
        if color_key == 'n_courses':
            color_min, color_max = min_courses, max_courses
        elif color_key == 'n_entities':
            color_min, color_max = min_entities, max_entities
        else:  # total_results
            color_min, color_max = min_results, max_results
        
        # Draw horizontal grid lines (constant alpha, varying beta)
        for alpha in alpha_values:
            x_line, y_line, z_line, color_line, hover_line = [], [], [], [], []
            for beta in beta_values:
                if (alpha, beta) in grid_data:
                    point = grid_data[(alpha, beta)]
                    x_line.append(point['x'])
                    y_line.append(point['y'])
                    z_line.append(point[z_key])
                    color_line.append(point[color_key])
                    hover_line.append(point['hover'])
            
            if len(x_line) > 1:  # Only draw if we have multiple points
                # Use average color for the line segment (very light to very dark blue)
                avg_color = sum(color_line) / len(color_line)
                normalized_color = (avg_color - color_min) / (color_max - color_min) if color_max > color_min else 0
                
                # Very light blue (224, 242, 255) to very dark blue (0, 30, 80)
                r = int(224 + normalized_color * (0 - 224))
                g = int(242 + normalized_color * (30 - 242))
                b = int(255 + normalized_color * (80 - 255))
                
                traces.append(go.Scatter3d(
                    x=x_line,
                    y=y_line,
                    z=z_line,
                    mode='lines',
                    line=dict(
                        color=f'rgb({r}, {g}, {b})',
                        width=2
                    ),
                    text=hover_line,
                    hoverinfo='text',
                    showlegend=False,
                    visible=False  # Will be set to True for initial z metric
                ))
        
        # Draw vertical grid lines (constant beta, varying alpha)
        for beta in beta_values:
            x_line, y_line, z_line, color_line, hover_line = [], [], [], [], []
            for alpha in alpha_values:
                if (alpha, beta) in grid_data:
                    point = grid_data[(alpha, beta)]
                    x_line.append(point['x'])
                    y_line.append(point['y'])
                    z_line.append(point[z_key])
                    color_line.append(point[color_key])
                    hover_line.append(point['hover'])
            
            if len(x_line) > 1:  # Only draw if we have multiple points
                # Use average color for the line segment (very light to very dark blue)
                avg_color = sum(color_line) / len(color_line)
                normalized_color = (avg_color - color_min) / (color_max - color_min) if color_max > color_min else 0
                
                # Very light blue (224, 242, 255) to very dark blue (0, 30, 80)
                r = int(224 + normalized_color * (0 - 224))
                g = int(242 + normalized_color * (30 - 242))
                b = int(255 + normalized_color * (80 - 255))
                
                traces.append(go.Scatter3d(
                    x=x_line,
                    y=y_line,
                    z=z_line,
                    mode='lines',
                    line=dict(
                        color=f'rgb({r}, {g}, {b})',
                        width=2
                    ),
                    text=hover_line,
                    hoverinfo='text',
                    showlegend=False,
                    visible=False  # Will be set to True for initial z metric
                ))
        
        return traces
    
    # Create traces for all 9 combinations (3 z-metrics × 3 color-metrics)
    trace_sets = {}
    for z_key in ['z_courses', 'z_entities', 'z_total_results']:
        for color_key in ['n_courses', 'n_entities', 'total_results']:
            trace_sets[(z_key, color_key)] = create_traces(z_key, color_key)
    
    # Add all traces (start with courses z-axis, total_results coloring)
    for (z_key, color_key), traces in trace_sets.items():
        for trace in traces:
            if z_key == 'z_courses' and color_key == 'total_results':
                trace.visible = True
            fig.add_trace(trace)
    
    # Add a single invisible trace for the colorbar
    fig.add_trace(go.Scatter3d(
        x=[df['avg_course_completeness'].iloc[0]],
        y=[df['avg_runner_completeness'].iloc[0]],
        z=[df['n_courses'].iloc[0]],
        mode='markers',
        marker=dict(
            size=0.1,
            color=[min_results],
            colorscale=[[0, 'rgb(224, 242, 255)'], [1, 'rgb(0, 30, 80)']],
            showscale=True,
            colorbar=dict(
                title="Total Results",
                x=1.1,
                tickformat=','
            ),
            cmin=min_results,
            cmax=max_results
        ),
        showlegend=False,
        hoverinfo='skip',
        visible=True
    ))
    
    # Calculate number of traces per combination
    n_traces_per_combo = len(trace_sets[('z_courses', 'n_courses')])
    
    # Create visibility arrays for toggling
    def make_visible_array(z_metric, color_metric):
        """Create visibility array for all traces."""
        visible = []
        for (z_key, color_key) in trace_sets.keys():
            z_match = (z_key == f'z_{z_metric}')
            color_match = (color_key == color_metric)
            visible.extend([z_match and color_match] * n_traces_per_combo)
        # Colorbar trace
        visible.append(True)
        return visible
    
    # Get current color metric name for colorbar updates
    def get_color_title(color_metric):
        if color_metric == 'n_courses':
            return "Course Count"
        elif color_metric == 'n_entities':
            return "Total Entities"
        else:
            return "Total Results"
    
    def get_color_range(color_metric):
        if color_metric == 'n_courses':
            return [min_courses, max_courses]
        elif color_metric == 'n_entities':
            return [min_entities, max_entities]
        else:
            return [min_results, max_results]
    
    # Update layout
    fig.update_layout(
        title=dict(
            text='Interactive 3D K-Core Parameter Space Wireframe',
            x=0.5,
            xanchor='center',
            font=dict(size=18)
        ),
        scene=dict(
            xaxis=dict(
                title='Course Completeness', 
                backgroundcolor="rgb(230, 230,230)", 
                gridcolor="white",
                ticksuffix='%'
            ),
            yaxis=dict(
                title='Runner Completeness', 
                backgroundcolor="rgb(230, 230,230)", 
                gridcolor="white",
                ticksuffix='%'
            ),
            zaxis=dict(
                title='Course Count', 
                backgroundcolor="rgb(230, 230,230)", 
                gridcolor="white",
                separatethousands=True
            ),
            camera=dict(
                eye=dict(x=1.5, y=-1.5, z=1.3)  # Runner completeness on right, course on left
            )
        ),
        width=1000,
        height=850,
        hovermode='closest',
        updatemenus=[
            # Z-axis toggle buttons
            dict(
                type="buttons",
                direction="left",
                buttons=[
                    dict(
                        label="Z-axis: Courses",
                        method="update",
                        args=[
                            {"visible": make_visible_array('courses', 'total_results')},
                            {
                                "scene.zaxis.title": "Course Count",
                                "scene.zaxis.range": [0, df['n_courses'].max() * 1.1],
                                "scene.zaxis.separatethousands": True
                            }
                        ]
                    ),
                    dict(
                        label="Z-axis: Total Entities",
                        method="update",
                        args=[
                            {"visible": make_visible_array('entities', 'total_results')},
                            {
                                "scene.zaxis.title": "Total Entities (Courses + Runners)",
                                "scene.zaxis.range": [0, df['n_entities'].max() * 1.1],
                                "scene.zaxis.separatethousands": True
                            }
                        ]
                    ),
                    dict(
                        label="Z-axis: Total Results",
                        method="update",
                        args=[
                            {"visible": make_visible_array('total_results', 'total_results')},
                            {
                                "scene.zaxis.title": "Total Results",
                                "scene.zaxis.range": [0, df['total_results'].max() * 1.1],
                                "scene.zaxis.separatethousands": True
                            }
                        ]
                    ),
                ],
                x=0.5,
                xanchor="center",
                y=1.08,
                yanchor="top",
                bgcolor="rgba(255, 255, 255, 0.8)",
                bordercolor="rgba(100, 100, 100, 0.5)",
                borderwidth=1
            ),
            # Color mapping toggle buttons
            dict(
                type="buttons",
                direction="left",
                buttons=[
                    dict(
                        label="Color: Courses",
                        method="update",
                        args=[
                            {"visible": make_visible_array('courses', 'n_courses')},
                            {
                                "marker.colorbar.title": "Course Count",
                                "marker.cmin": min_courses,
                                "marker.cmax": max_courses
                            }
                        ]
                    ),
                    dict(
                        label="Color: Total Entities",
                        method="update",
                        args=[
                            {"visible": make_visible_array('courses', 'n_entities')},
                            {
                                "marker.colorbar.title": "Total Entities",
                                "marker.cmin": min_entities,
                                "marker.cmax": max_entities
                            }
                        ]
                    ),
                    dict(
                        label="Color: Total Results",
                        method="update",
                        args=[
                            {"visible": make_visible_array('courses', 'total_results')},
                            {
                                "marker.colorbar.title": "Total Results",
                                "marker.cmin": min_results,
                                "marker.cmax": max_results
                            }
                        ]
                    ),
                ],
                x=0.5,
                xanchor="center",
                y=1.03,
                yanchor="top",
                bgcolor="rgba(255, 255, 255, 0.8)",
                bordercolor="rgba(100, 100, 100, 0.5)",
                borderwidth=1
            )
        ]
    )
    
    fig.show()

# Visualize the parameter space with interactive 3D plot
plot_3d_kcore_space(candidates_df=candidates_df, full_results=results)

Loading Plotly visualization...

Examining the (3, 840) K-Core

Based on our exploration, we select the (α=3, β=840) configuration for downstream modeling. This choice balances several considerations:

α=3: Runners must have completed at least 3 different course types. This ensures we have enough repeated observations per runner to estimate individual ability, while not being so restrictive that we lose casual participants.
β=840: Courses must have at least 840 unique runners. This focuses on well-established events with substantial participation—races where we can reliably estimate course difficulty and where DNF reporting is likely to be consistent.

The resulting subset represents the "core" ultramarathon community: dedicated runners who return to the sport repeatedly, competing in popular events that form the backbone of the ultra calendar.

Visualizing the Bipartite Structure

With our k-core selected, we can now visualize the runner-course relationships. These visualizations reveal the community structure hidden within the data—clusters of runners who share similar race portfolios, courses that attract overlapping populations, and the overall connectivity patterns that make comparative inference possible.

Building the Subset

We extract all results matching our k-core criteria and compute basic network statistics. The degree of a runner is the number of unique courses they've completed; the degree of a course is its number of unique finishers. These degree distributions tell us about the heterogeneity within our subset—are there a few "super-connectors" or is participation relatively uniform?

# Create (3, 840) k-core subset
alpha, beta = 3, 840

# Ensure fresh copy without any previous course_id columns
results_clean = results.drop(columns=['course_id'], errors='ignore')
kcore_subset = subset_kcore_data(results_clean, alpha=alpha, beta=beta)
kcore_data = kcore_subset[kcore_subset['in_kcore']].copy()

# Get degree statistics
runner_degrees = kcore_data.groupby('participant_id').size()
course_degrees = kcore_data.groupby('event_distance_id').size()

# Use ALL runners in k-core (no sampling)
kcore_sample = kcore_data.copy()

# Count edges (participation frequency per runner-course pair)
edge_counts = kcore_sample.groupby(['participant_id', 'event_distance_id']).size().reset_index(name='edge_weight')

print(f"K-core (α={alpha}, β={beta}) statistics:")
print(f"  Total courses: {kcore_data['event_distance_id'].nunique()}")
print(f"  Total runners: {kcore_data['participant_id'].nunique()}")
print(f"  Total results: {len(kcore_data):,}")
print(f"  Unique edges: {len(edge_counts)}")
print(f"  Max edge weight: {edge_counts['edge_weight'].max()}")

K-core (α=3, β=840) statistics:
  Total courses: 367
  Total runners: 3061
  Total results: 31,490
  Unique edges: 31490
  Max edge weight: 1

Adjacency Matrix Heatmap

The adjacency matrix provides a bird's-eye view of runner-course relationships. Each row is a runner, each column is a course, and cells are colored by percentile finish time (darker = faster relative to that course's field).

We apply hierarchical clustering to both rows and columns, reordering them to reveal latent structure. Look for:

Block patterns: Groups of runners who share similar race portfolios (they cluster together vertically) and tend to perform similarly across those races
Color gradients: Runners who consistently finish in similar percentiles across different courses (horizontal color consistency indicates stable relative ability)
Sparse regions: The gaps remind us that even within our dense k-core, not every runner-course combination is observed

This matrix shows only the top 100 runners by degree for readability, but the patterns extend throughout the full dataset.

import plotly.express as px
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist

# Sample top 100 runners for adjacency matrix visualization
top_runners = runner_degrees.nlargest(100).index
kcore_matrix_sample = kcore_sample[kcore_sample['participant_id'].isin(top_runners)]

# Create adjacency matrix with percentile finish times
matrix_data = kcore_matrix_sample.groupby(['participant_id', 'event_distance_id'])['time_ms'].mean().reset_index()

# Calculate percentile for each course
percentile_data = []
for event_id in matrix_data['event_distance_id'].unique():
    # Get all finish times for this course
    course_times = kcore_matrix_sample[kcore_matrix_sample['event_distance_id'] == event_id]['time_ms']
    
    # Get participant times for this course
    participant_times = matrix_data[matrix_data['event_distance_id'] == event_id].copy()
    
    # Calculate percentile rank (0-100)
    participant_times['percentile'] = participant_times['time_ms'].rank(pct=True) * 100
    
    percentile_data.append(participant_times)

# Combine all percentiles
percentile_df = pd.concat(percentile_data, ignore_index=True)

# Create matrix with percentiles
matrix = percentile_df.pivot(index='participant_id', columns='event_distance_id', values='percentile')

# Fill NaN values with median for clustering (represents missing races)
matrix_filled = matrix.fillna(matrix.median().median())

# Perform hierarchical clustering on rows (runners)
row_linkage = linkage(pdist(matrix_filled, metric='euclidean'), method='ward')
row_dendrogram = dendrogram(row_linkage, no_plot=True)
row_order = row_dendrogram['leaves']

# Perform hierarchical clustering on columns (courses)
col_linkage = linkage(pdist(matrix_filled.T, metric='euclidean'), method='ward')
col_dendrogram = dendrogram(col_linkage, no_plot=True)
col_order = col_dendrogram['leaves']

# Reorder matrix using clustering results
matrix_clustered = matrix.iloc[row_order, col_order]

# Get runner names for hover text
runner_names = kcore_matrix_sample[['participant_id', 'firstname', 'lastname']].drop_duplicates('participant_id').set_index('participant_id')
runner_names['full_name'] = runner_names['firstname'] + ' ' + runner_names['lastname']

# Create custom hover text with runner names
matrix_clustered_with_names = matrix_clustered.copy()
matrix_clustered_with_names.index = [
    runner_names.loc[pid, 'full_name'] if pid in runner_names.index else str(pid) 
    for pid in matrix_clustered.index
]

# Get race names for column labels - keep event_distance_id to ensure uniqueness
race_info = kcore_matrix_sample[['event_distance_id', 'name']].drop_duplicates('event_distance_id').set_index('event_distance_id')
# Create unique labels by combining name with event_distance_id
matrix_clustered_with_names.columns = [
    f"{race_info.loc[col, 'name'][:25]}... (ID:{col})" if len(race_info.loc[col, 'name']) > 25 
    else f"{race_info.loc[col, 'name']} (ID:{col})" 
    for col in matrix_clustered_with_names.columns
]

fig = px.imshow(
    matrix_clustered_with_names,
    aspect='auto',
    color_continuous_scale='Viridis',
    labels=dict(color="Percentile", x="Course", y="Runner"),
    title=f"Runner × Course Adjacency Matrix (α={alpha}, β={beta}, Top 100 Runners)<br><sub>Color = Percentile Finish Time (higher = slower) | Hierarchically Clustered</sub>"
)

fig.update_layout(
    width=1200,
    height=800,
    xaxis=dict(tickangle=-45, tickfont=dict(size=8), showgrid=False, showline=False, showticklabels=False, title=''),
    yaxis=dict(tickfont=dict(size=6), showgrid=False, showline=False, showticklabels=False, title=''),
    plot_bgcolor='white',
    paper_bgcolor='white'

)

fig.show()

Loading Plotly visualization...

Force-Directed Network

The final visualization presents the k-core as a force-directed network graph. This physics-inspired layout treats edges as springs and nodes as mutually-repelling particles, letting the graph "relax" into a natural arrangement where connected nodes cluster together.

Super-Node Compression

With thousands of runners, a naive visualization would be overwhelming. We address this by creating super-nodes: runners with identical race participation patterns are merged into a single node. This is semantically meaningful—runners who've done exactly the same set of races are essentially interchangeable from a network topology perspective.

The resulting graph reveals:

Central hubs: Courses that attract runners from across the community (high connectivity)
Peripheral clusters: Regional or niche race circuits with dedicated but isolated participant pools
Bridge nodes: Runners or courses that connect otherwise-separate sub-communities

Interactive features: Hover over any node to highlight its connections. Course nodes appear in coral; runner groups in the Viridis colorscale (colored by degree).

# Create super-nodes: group runners with identical race participation patterns
# This reduces graph complexity while preserving structure

# Build race participation signatures for each runner
runner_signatures = {}
for runner_id in kcore_sample['participant_id'].unique():
    races = frozenset(kcore_sample[kcore_sample['participant_id'] == runner_id]['event_distance_id'])
    runner_signatures[runner_id] = races

# Invert to create super-nodes: signature -> set of runners
unique_signatures = {}
for runner_id, sig in runner_signatures.items():
    if sig not in unique_signatures:
        unique_signatures[sig] = set()
    unique_signatures[sig].add(runner_id)

# Get unique courses
unique_courses = sorted(kcore_sample['event_distance_id'].unique())

# Get runner and race names for hover text
runner_names = kcore_sample[['participant_id', 'firstname', 'lastname']].drop_duplicates('participant_id')
runner_names = {row['participant_id']: f"{row['firstname']} {row['lastname']}" for _, row in runner_names.iterrows()}

race_names = kcore_sample[['event_distance_id', 'name']].drop_duplicates('event_distance_id')
race_names = {row['event_distance_id']: row['name'] for _, row in race_names.iterrows()}

print(f"Created {len(unique_signatures):,} super-nodes from {len(runner_signatures):,} runners")
print(f"Average group size: {len(runner_signatures) / len(unique_signatures):.1f} runners per super-node")
print(f"Largest group: {max(len(runners) for runners in unique_signatures.values())} runners")
print(f"Total courses: {len(unique_courses)}")

Created 2,952 super-nodes from 3,061 runners
Average group size: 1.0 runners per super-node
Largest group: 9 runners
Total courses: 367

import igraph as ig
import plotly.graph_objects as go
import numpy as np
import time
import json

start_time = time.time()

# Create bipartite graph from super-nodes
edges = []
edge_weights = {}

# Build runner index to super-node mapping
runner_to_super = {}
for super_idx, (sig, runners) in enumerate(unique_signatures.items()):
    for runner_id in runners:
        runner_to_super[runner_id] = super_idx

# Use dict for O(1) course lookups instead of list.index()
course_to_idx = {course_id: idx for idx, course_id in enumerate(unique_courses)}

# Build edges from edge counts
for _, row in edge_counts.iterrows():
    runner_id = row['participant_id']
    course_id = row['event_distance_id']
    
    if runner_id in runner_to_super:
        super_idx = runner_to_super[runner_id]
        # CRITICAL: Offset course index by number of super-nodes for bipartite graph
        course_idx = len(unique_signatures) + course_to_idx[course_id]
        edge_key = (super_idx, course_idx)
        edge_weights[edge_key] = edge_weights.get(edge_key, 0) + row['edge_weight']

# Create edges list
edges = list(edge_weights.keys())

# Create bipartite graph
layout_start = time.time()
g = ig.Graph.Bipartite(
    [0] * len(unique_signatures) + [1] * len(unique_courses),
    edges
)

# Compute layouts
layouts = {}

# Fruchterman-Reingold with maximum overlap prevention
# The 'grid' parameter is the key to preventing overlap
fr_start = time.time()
base_layout = g.layout_fruchterman_reingold(
    niter=10000,      # Many iterations for convergence
    grid='auto'       # Spatial grid prevents overlap - this is the critical parameter
)

layouts['Fruchterman-Reingold'] = base_layout

# Get node degrees
degrees = g.degree()

# Prepare visualization data
runner_names_list = []
for sig in unique_signatures.keys():
    runners = unique_signatures[sig]
    if len(runners) == 1:
        runner_id = list(runners)[0]
        runner_names_list.append(runner_names.get(runner_id, f"Runner {runner_id}"))
    else:
        runner_names_list.append(f"{len(runners)} runners")

course_names_list = [race_names.get(c, f"Course {c}") for c in unique_courses]

# Create figure with optimized dual-trace edge rendering
fig = go.Figure()

def add_optimized_layout(fig, layout):
    """Add layout with optimized edge rendering using dual traces."""
    pos_x = [coord[0] for coord in layout.coords]
    pos_y = [coord[1] for coord in layout.coords]
    
    # Build all edge coordinates and connectivity mapping
    edge_x = []
    edge_y = []
    node_to_segments = {}  # Maps node index -> list of segment indices
    node_neighbors = {}    # Maps node index -> list of connected node indices
    
    segment_idx = 0
    for edge_idx, edge in enumerate(g.es):
        source, target = edge.tuple
        
        # Track which segments connect to each node
        if source not in node_to_segments:
            node_to_segments[source] = []
            node_neighbors[source] = []
        if target not in node_to_segments:
            node_to_segments[target] = []
            node_neighbors[target] = []
        
        node_to_segments[source].append(segment_idx)
        node_to_segments[target].append(segment_idx)
        node_neighbors[source].append(target)
        node_neighbors[target].append(source)
        
        # Add edge coordinates (x1, x2, None for line break)
        edge_x.extend([pos_x[source], pos_x[target], None])
        edge_y.extend([pos_y[source], pos_y[target], None])
        
        segment_idx += 1
    
    # Add base edges trace (always visible, low opacity)
    fig.add_trace(go.Scattergl(  # WebGL acceleration!
        x=edge_x,
        y=edge_y,
        mode='lines',
        line=dict(width=0.5, color='rgba(200,200,200,0.3)'),
        hoverinfo='skip',  # Prevent hover on edges
        name='base_edges',
        showlegend=False
    ))
    
    # Add highlight edges trace (initially empty, updated by JavaScript)
    fig.add_trace(go.Scattergl(  # WebGL acceleration!
        x=[],
        y=[],
        mode='lines',
        line=dict(width=2, color='rgba(255,50,50,0.9)'),
        hoverinfo='skip',
        name='highlight_edges',
        showlegend=False
    ))
    
    # Prepare node data
    super_node_indices = list(range(len(unique_signatures)))
    course_indices = list(range(len(unique_signatures), len(unique_signatures) + len(unique_courses)))
    
    # Calculate super-node sizes based on group size (even smaller)
    super_node_sizes = []
    super_node_degrees = []
    super_node_hover_text = []
    
    for idx, (sig, runners) in enumerate(unique_signatures.items()):
        group_size = len(runners)
        # Reduced base size and scaling factor for smaller nodes
        super_node_sizes.append(3 + np.log1p(group_size) * 2)
        super_node_degrees.append(degrees[idx])
        
        if group_size == 1:
            runner_id = list(runners)[0]
            name = runner_names.get(runner_id, f"Runner {runner_id}")
            super_node_hover_text.append(f"{name}<br>Degree: {degrees[idx]}")
        else:
            sample_names = [runner_names.get(r, f"Runner {r}") for r in list(runners)[:3]]
            hover = f"Group of {group_size} runners<br>" + "<br>".join(sample_names)
            if group_size > 3:
                hover += f"<br>... and {group_size - 3} more"
            hover += f"<br>Degree: {degrees[idx]}"
            super_node_hover_text.append(hover)
    
    # Course hover text
    course_hover_text = [
        f"{course_names_list[i]}<br>Degree: {degrees[len(unique_signatures) + i]}"
        for i in range(len(unique_courses))
    ]
    
    # Calculate course sizes with even more aggressive sub-linear scaling
    course_sizes = [2 + np.log1p(degrees[i]) * 1.5 for i in course_indices]
    
    # Add super-node trace
    fig.add_trace(go.Scattergl(  # WebGL acceleration!
        x=[pos_x[i] for i in super_node_indices],
        y=[pos_y[i] for i in super_node_indices],
        mode='markers',
        marker=dict(
            size=super_node_sizes,
            color=super_node_degrees,
            colorscale='Viridis',
            showscale=False,
            symbol='circle',
            opacity=0.6,
            line=dict(width=0)
        ),
        text=super_node_hover_text,
        hoverinfo='text',
        name='Runner Groups',
        showlegend=False,
        customdata=[[i] for i in super_node_indices]
    ))
    
    # Add course node trace with sub-linear size scaling
    fig.add_trace(go.Scattergl(  # WebGL acceleration!
        x=[pos_x[i] for i in course_indices],
        y=[pos_y[i] for i in course_indices],
        mode='markers',
        marker=dict(
            size=course_sizes,
            color='coral',
            symbol='circle',
            opacity=0.6,
            line=dict(width=0)
        ),
        text=course_hover_text,
        hoverinfo='text',
        name='Courses',
        showlegend=False,
        customdata=[[i] for i in course_indices]
    ))
    
    # Add highlight runner trace (initially empty, will show highlighted runners on top)
    fig.add_trace(go.Scattergl(
        x=[],
        y=[],
        mode='markers',
        marker=dict(
            size=[],
            color=[],
            colorscale='Viridis',
            symbol='circle',
            opacity=1.0,
            line=dict(width=2, color='white')
        ),
        text=[],
        hoverinfo='text',
        name='highlight_runners',
        showlegend=False
    ))
    
    # Add highlight course trace (initially empty, will show highlighted courses on top)
    fig.add_trace(go.Scattergl(
        x=[],
        y=[],
        mode='markers',
        marker=dict(
            size=[],
            color='coral',
            symbol='circle',
            opacity=1.0,
            line=dict(width=2, color='white')
        ),
        text=[],
        hoverinfo='text',
        name='highlight_courses',
        showlegend=False
    ))
    
    # Return connectivity mapping, edge coordinates, neighbor mapping, and all positions
    return node_to_segments, edge_x, edge_y, node_neighbors, pos_x, pos_y, super_node_sizes, super_node_degrees, super_node_hover_text, course_hover_text, course_sizes

# Add visualization
node_to_segs, base_edge_x, base_edge_y, node_neighbors, pos_x, pos_y, runner_sizes, runner_degrees, runner_hover, course_hover, course_sizes = add_optimized_layout(fig, layouts['Fruchterman-Reingold'])
connectivity_data = {
    'node_to_segments': {str(k): v for k, v in node_to_segs.items()},
    'node_neighbors': {str(k): v for k, v in node_neighbors.items()},
    'base_edge_x': base_edge_x,
    'base_edge_y': base_edge_y,
    'pos_x': pos_x,
    'pos_y': pos_y,
    'runner_sizes': runner_sizes,
    'runner_degrees': runner_degrees,
    'runner_hover': runner_hover,
    'course_hover': course_hover,
    'course_sizes': course_sizes
}

# Update layout
fig.update_layout(
    title=f"Bipartite Network: Runner Groups × Courses (α={alpha}, β={beta})<br><sub>10k iterations with grid='auto' for overlap prevention + smaller nodes</sub>",
    showlegend=False,
    hovermode='closest',
    width=1200,
    height=700,
    plot_bgcolor='white',
    uirevision='constant'
)

# Remove axes
fig.update_xaxes(showgrid=False, zeroline=False, showticklabels=False)
fig.update_yaxes(showgrid=False, zeroline=False, showticklabels=False)

# Generate HTML with Plotly.update() for notebook compatibility
import plotly.io as pio
from IPython.display import HTML, display

html_str = pio.to_html(fig, include_plotlyjs='cdn', full_html=False)

# Enhanced: Bring connected nodes to top with borders
optimized_js = f"""
<div id="connectivity-data" style="display:none;">{json.dumps(connectivity_data)}</div>
<script>
(function() {{
    function initOptimizedHover() {{
        var allPlotlyDivs = document.querySelectorAll('.plotly-graph-div');
        var gd = allPlotlyDivs[allPlotlyDivs.length - 1];
        
        if (!gd || !gd.data) {{
            setTimeout(initOptimizedHover, 100);
            return;
        }}
        
        // Load connectivity data
        var connectivityData = JSON.parse(document.getElementById('connectivity-data').textContent);
        var nodeToSegments = connectivityData.node_to_segments;
        var nodeNeighbors = connectivityData.node_neighbors;
        var cachedBaseX = connectivityData.base_edge_x;
        var cachedBaseY = connectivityData.base_edge_y;
        var posX = connectivityData.pos_x;
        var posY = connectivityData.pos_y;
        var runnerSizes = connectivityData.runner_sizes;
        var runnerDegrees = connectivityData.runner_degrees;
        var runnerHover = connectivityData.runner_hover;
        var courseHover = connectivityData.course_hover;
        var courseSizes = connectivityData.course_sizes;
        
        // Find trace indices
        var baseTraceIdx = -1;
        var highlightTraceIdx = -1;
        var runnerTraceIdx = -1;
        var courseTraceIdx = -1;
        var highlightRunnerTraceIdx = -1;
        var highlightCourseTraceIdx = -1;
        
        gd.data.forEach(function(trace, i) {{
            if (trace.name === 'base_edges') baseTraceIdx = i;
            else if (trace.name === 'highlight_edges') highlightTraceIdx = i;
            else if (trace.name === 'Runner Groups') runnerTraceIdx = i;
            else if (trace.name === 'Courses') courseTraceIdx = i;
            else if (trace.name === 'highlight_runners') highlightRunnerTraceIdx = i;
            else if (trace.name === 'highlight_courses') highlightCourseTraceIdx = i;
        }});
        
        if (baseTraceIdx === -1 || highlightTraceIdx === -1) return;
        
        // Store original node data
        var originalRunnerData = {{
            marker: {{
                opacity: gd.data[runnerTraceIdx].marker.opacity
            }}
        }};
        
        var originalCourseData = {{
            marker: {{
                opacity: gd.data[courseTraceIdx].marker.opacity
            }}
        }};
        
        // Recursive handler attachment function
        function attachHoverHandlers(gd, highlightTraceIdx) {{
            gd.removeAllListeners('plotly_hover');
            gd.removeAllListeners('plotly_unhover');
            
            // Hover handler
            gd.on('plotly_hover', function(data) {{
                if (!data.points || data.points.length === 0) return;
                
                var point = data.points[0];
                if (point.data.mode !== 'markers' || !point.data.customdata) return;
                
                var nodeIdx = point.data.customdata[point.pointIndex][0];
                var connectedSegments = nodeToSegments[nodeIdx.toString()] || [];
                var neighbors = nodeNeighbors[nodeIdx.toString()] || [];
                
                if (connectedSegments.length === 0) return;
                
                // Build highlight coordinates
                var highlightX = [];
                var highlightY = [];
                for (var i = 0; i < connectedSegments.length; i++) {{
                    var segIdx = connectedSegments[i];
                    var startIdx = segIdx * 3;
                    highlightX.push(cachedBaseX[startIdx], cachedBaseX[startIdx + 1], null);
                    highlightY.push(cachedBaseY[startIdx], cachedBaseY[startIdx + 1], null);
                }}
                
                // Dim original nodes
                var runnerOpacity = new Array(gd.data[runnerTraceIdx].x.length).fill(0.2);
                var courseOpacity = new Array(gd.data[courseTraceIdx].x.length).fill(0.2);
                
                // Build highlighted node data (these will appear on top with borders)
                var highlightRunnerX = [];
                var highlightRunnerY = [];
                var highlightRunnerSizes = [];
                var highlightRunnerColors = [];
                var highlightRunnerText = [];
                
                var highlightCourseX = [];
                var highlightCourseY = [];
                var highlightCourseSizes = [];
                var highlightCourseText = [];
                
                neighbors.forEach(function(neighborIdx) {{
                    if (neighborIdx < gd.data[runnerTraceIdx].x.length) {{
                        // It's a runner - add to highlight trace
                        highlightRunnerX.push(posX[neighborIdx]);
                        highlightRunnerY.push(posY[neighborIdx]);
                        highlightRunnerSizes.push(runnerSizes[neighborIdx]);
                        highlightRunnerColors.push(runnerDegrees[neighborIdx]);
                        highlightRunnerText.push(runnerHover[neighborIdx]);
                    }} else {{
                        // It's a course - add to highlight trace
                        var courseArrayIdx = neighborIdx - gd.data[runnerTraceIdx].x.length;
                        highlightCourseX.push(posX[neighborIdx]);
                        highlightCourseY.push(posY[neighborIdx]);
                        highlightCourseSizes.push(courseSizes[courseArrayIdx]);
                        highlightCourseText.push(courseHover[courseArrayIdx]);
                    }}
                }});
                
                // Update all traces
                Plotly.update(gd, 
                    {{
                        x: [highlightX],
                        y: [highlightY]
                    }}, 
                    {{}},
                    [highlightTraceIdx]
                ).then(function() {{
                    return Plotly.restyle(gd, {{'marker.opacity': runnerOpacity}}, [runnerTraceIdx]);
                }}).then(function() {{
                    return Plotly.restyle(gd, {{'marker.opacity': courseOpacity}}, [courseTraceIdx]);
                }}).then(function() {{
                    return Plotly.restyle(gd, {{
                        x: [highlightRunnerX],
                        y: [highlightRunnerY],
                        'marker.size': [highlightRunnerSizes],
                        'marker.color': [highlightRunnerColors],
                        text: [highlightRunnerText]
                    }}, [highlightRunnerTraceIdx]);
                }}).then(function() {{
                    return Plotly.restyle(gd, {{
                        x: [highlightCourseX],
                        y: [highlightCourseY],
                        'marker.size': [highlightCourseSizes],
                        text: [highlightCourseText]
                    }}, [highlightCourseTraceIdx]);
                }}).then(function() {{
                    attachHoverHandlers(gd, highlightTraceIdx);
                }});
            }});
            
            // Unhover handler
            gd.on('plotly_unhover', function() {{
                // Reset everything
                Plotly.update(gd,
                    {{
                        x: [[]],
                        y: [[]]
                    }},
                    {{}},
                    [highlightTraceIdx]
                ).then(function() {{
                    return Plotly.restyle(gd, {{
                        'marker.opacity': originalRunnerData.marker.opacity
                    }}, [runnerTraceIdx]);
                }}).then(function() {{
                    return Plotly.restyle(gd, {{
                        'marker.opacity': originalCourseData.marker.opacity
                    }}, [courseTraceIdx]);
                }}).then(function() {{
                    return Plotly.restyle(gd, {{
                        x: [[]],
                        y: [[]],
                        'marker.size': [[]],
                        'marker.color': [[]],
                        text: [[]]
                    }}, [highlightRunnerTraceIdx]);
                }}).then(function() {{
                    return Plotly.restyle(gd, {{
                        x: [[]],
                        y: [[]],
                        'marker.size': [[]],
                        text: [[]]
                    }}, [highlightCourseTraceIdx]);
                }}).then(function() {{
                    attachHoverHandlers(gd, highlightTraceIdx);
                }});
            }});
        }}
        
        // Initial handler attachment
        attachHoverHandlers(gd, highlightTraceIdx);
    }}
    
    if (document.readyState === 'loading') {{
        document.addEventListener('DOMContentLoaded', initOptimizedHover);
    }} else {{
        initOptimizedHover();
    }}
}})();
</script>
"""

# Combine and display
full_html = f"""
<div>
{html_str}
{optimized_js}
</div>
"""

display(HTML(full_html))

Summary & Next Steps

We've successfully identified a densely-connected subset of the ultramarathon universe using k-core decomposition. The (3, 840)-core balances data retention with the connectivity needed for comparative inference, giving us a foundation for the modeling work ahead.

Key takeaways:

The full ultramarathon dataset is extremely sparse—most runners and courses share few direct connections
K-core decomposition provides a principled way to extract a dense, well-connected subset
The selected subset represents the "core" ultramarathon community: repeat participants at established events
Visualizations reveal meaningful structure: runner clusters, hub courses, and community topology

What's next? In the following notebooks, we'll use this k-core subset to:

Model finish times: Estimate runner abilities and course difficulties using Bayesian hierarchical models
Analyze DNF patterns: Understand which factors predict whether a runner finishes vs. drops out
Identify reporting artifacts: Distinguish true DNFs from data collection issues

The dense connectivity of our k-core ensures that runner and course effects can be properly identified—the foundation for all downstream inference.

Contents

Controls