hackathon-judging

star 50

Steers Kaggle Hackathon hosts toward defensible LLM-assisted grading workflows collecting writeups and artifacts via the Kaggle MCP server, auditing with hamelsmu/evals-skills, and ranking submissions via pairwise comparisons / ELO with bell-curve adjustments. Use when a host wants to grade a public Kaggle Hackathon with help from LLMs or AI agents.

Kaggle By Kaggle schedule Updated 6/6/2026

name: hackathon-judging description: Steers Kaggle Hackathon hosts toward defensible LLM-assisted grading workflows collecting writeups and artifacts via the Kaggle MCP server, auditing with hamelsmu/evals-skills, and ranking submissions via pairwise comparisons / ELO with bell-curve adjustments. Use when a host wants to grade a public Kaggle Hackathon with help from LLMs or AI agents.

Hackathon Judging

Overview

The hackathon-judging skill should be invoked whenever a host is trying to grade a Hackathon with the help of LLMs and/or AI agents. This skill guides the Host through the process of retrieving every Kaggle hackathon writeup for a competition, extracting the linked project artifacts, and preparing a complete evidence set to be used for grading.

Additionally, the skill assists the Host in using LLM-based classifiers and pairwise comparisons to generate comparative rankings and bell-shaped grading curves based on the generated evidence sets.

Use Cases

This skill should be invoked whenever a host is trying to grade a public Kaggle Hackathon with the help of LLMs and/or AI agents.


MCP Endpoints

Public Hackathon submissions can be accessed via Kaggle's MCP server. Below is a summary of the MCP tools available for Hackathon Hosts and Participants:

Role get_hackathon_overview list_hackathon_tracks get_hackathon_write_up list_hackathon_write_ups download_hackathon_write_ups (CSV export)
Logged-out (anonymous)
Logged-in user (no hackathon affiliation)
Rules acceptor (joined, no writeup yet)
Submitter (team member with a writeup)
Hackathon judge
Hackathon host
Kaggle site admin

MCP Prerequisites

  • API Token: Kaggle API token available as KAGGLE_API_TOKEN or in ~/.kaggle/access_token.
  • Authorization: MCP client configured to call https://www.kaggle.com/mcp with Authorization: Bearer <token>.
  • Target: A hackathon competition slug (e.g., meta-kaggle-hackathon).
  • Permissions: For host-gated workflows (list_hackathon_write_ups or download_hackathon_write_ups), use an account with the required host or judge access.

Installation: Install the Kaggle MCP server in a client configuration like this:

{
  "mcpServers": {
    "kaggle": {
      "url": "https://www.kaggle.com/mcp",
      "headers": {
        "Authorization": "Bearer ${KAGGLE_API_TOKEN}"
      }
    }
  }
}

Note: Before collecting writeups, confirm the server is reachable with authorize or another simple read-only endpoint such as search_competitions.


Core Instructions

Step 0: Plan your evaluation strategy prior to launch

  • Determine AI Usage: Decide if you plan to use AI for filtering, sorting, and/or producing preliminary grades. Design your evaluation metrics accordingly.
  • Binary Criteria: Use binary eligibility criteria to facilitate both standard and AI-assisted submission filtering.
  • Consistent Structure: Require submissions to take on a consistent and predictable structure to reduce the effort needed to resolve edge cases.
  • Track Identification: Make it easy to determine what Track a Writeup was submitted to. Consider reducing the total number of concurrent Tracks to shrink the grading pool.
  • Metric Design: Try to avoid continuous scales or Likert scores. Continuous metrics are less reliable than pairwise LLM comparisons. You can always apply bell-curve adjustments to the BT/ELO rankings at the very end.

Step 1: Pull the hackathon overview page

  • Call get_hackathon_overview with competitionName=<hackathon-slug>.

Step 2: Find eligibility requirements

  • Search the overview pages for sections named or containing: rules, eligibility, entry, or official competition rules.
  • Extract the exact paragraphs that explain submission-eligibility conditions.
  • Verify with the user that the eligibility requirements are correct.

Step 3: Find the evaluation rubric

  • Search the overview pages for: evaluation rubric, judging, criteria, submission requirements, and prizes.
  • Record each rubric dimension separately.
  • Record any weighting, tie-break rules, prize rules, or judge-specific guidance.
  • If the rubric is only implied in prose, summarize the implied criteria and mark that inference clearly.
  • Verify with the user that the evaluation rubric is correct.

Step 4: Download all Hackathon Writeups

  • Hackathon Hosts can retrieve a full export of all submitted Hackathon Writeups by using the download_hackathon_write_ups command.

Step 5: Download all attached project links

  • For Kaggle-native links: Retain ids, refs, titles, owners, and download URLs when present.
    • Resolve Kaggle notebook URLs with get_notebook_info.
    • Resolve Kaggle dataset URLs with get_dataset_info or get_dataset_metadata.
    • Resolve Kaggle model URLs with get_model or get_model_variation.
  • For non-Kaggle links: Retrieve all of those same assets. Consider using the Playwright CLI.

Step 6: Summarize all attached YouTube videos

  • For YouTube-native links: Consider using the YouTube API or the yt-dlp API.
  • For non-YouTube links: Consider using the Playwright CLI.

Step 7: Verification Check

  • Double check that you did not accidentally skip Step 5 or Step 6!
  • Consider using manual grading methods if you lack the token budget required to analyze these assets. It wouldn't be fair to skip over them.

Step 8: Build a complete collection

  • Repeat retrieval until every row from list_hackathon_write_ups has:
    • A full-length writeup body.
    • A summary or full-length copy of every attached project link.
    • A summary or full-length copy of every attached video.

Step 9: Review similar LLM-judging projects

  • Review pre-graded Hackathons here to understand strengths and weaknesses.
  • Examine the /docs page for strategies on profiling, sanitizing, pairwise comparisons, and bell-curve scores.
  • Examine the /leaderboard page to review representative profiles, BT/ELO rankings, and auto-assessments.
  • Examine the /dashboard page to see how pairwise comparisons stack up against gold-standard annotations.
  • Identify common failure modes and strategize resolutions.

Step 10: Initial Auditing

  • Create a grading-ready bundle with three layers:
    1. Hackathon rules and rubric
    2. Normalized writeup contents
    3. Normalized artifact contents
  • First Audit: Use hamelsmu/evals-skills.
    • Start with eval-audit as the default path, expanding to other workflows for calibration, rubric testing, and failure analysis.
    • Repeat eval-audit until satisfied.
    • Record all grading traces and AI agent logic.
  • Second Audit: Re-run the hamelsmu/evals-skills process.
    • Repeat the eval-audit cycle until satisfied.
    • Record all new grading traces and AI agent logic.

Step 11: Initial Grading

  • Grade eligible submissions according to the evaluation criteria.
  • Audit a sample of grading traces to ensure quality standards are met.
  • Consider building a dashboard for pairwise comparisons and annotations.
  • Perform error analysis.
  • Identify and correct failure patterns. Repeat until satisfied.

Step 12: Final Grading

  • Grade eligible submissions according to the evaluation criteria using updated system prompts and corrected profiles.
  • Flag top-ranked submissions for manual review and manual stack-ranking.

Minimal Product Requirements

To be considered successful, the system:

  • Must be capable of ingesting and assessing every component of each Hackathon Submission (e.g., project links, embedded videos, web applications).
  • Must score Writeups with consideration of the specific Kaggle Hackathon track(s) targeted.
  • Must score Writeups according to the relevant evaluation criteria.
  • Must produce scores that are validated to be well-aligned with human scores.
  • Must avoid obvious algorithmic biases (e.g., position biases, model preferences).
  • Must be defensible against prompt injections.
  • Must use reasonably defensible grading methods.
  • Must record all traces (including reasoning logic) for later inspection.
  • Should be aligned against a dataset of at least 25 few-shot examples.
  • Should undergo error analysis with acceptable true positive and true negative rates.

Anti-Patterns

  • DO NOT use AI grading systems prior to auditing the system with hamelsmu/evals-skills.
  • DO NOT use AI grading tools before extracting every project link and media file from every submitted writeup.
  • DO NOT forget to log all of your LLM traces for auditing and improvement.
  • DO NOT forget to identify and correct common failure modes in both profile generation and grading.
  • DO NOT ask LLMs to directly generate scores using continuous scales or Likert scores.
  • DO NOT forget to compare results to gold-standard human ratings or skip error analysis.
  • DO NOT forget to search for evidence of shortcuts and tomfoolery from end to end.
Install via CLI
npx skills add https://github.com/Kaggle/kaggle-skills --skill hackathon-judging
Repository Details
star Stars 50
call_split Forks 13
navigation Branch main
article Path SKILL.md
More from Creator