filter - SKILL.md Agent Skill

name: filter description: Pre-filter scraped LinkedIn profiles against Google Sheets exclusion lists (past candidates, blacklist, not-relevant companies) and Supabase dedup. Sends results to Slack for approval. Use when you have scraped profiles that need filtering before enrichment. argument-hint: [scraped-profiles-csv-or-json]

Pre-Filter Skill

Filter scraped LinkedIn profiles against exclusion lists and report results via Slack.

Your Task

Take scraped profiles (from PhantomBuster CSV or JSON) and filter them against:

Past Candidates - by full name match (Google Sheet)
Blacklist - by company name match (Google Sheet)
Not Relevant Companies - by company name match (Google Sheet)
Already Enriched - by LinkedIn URL (Supabase dedup)

Then send a summary to Slack and ask for approval to proceed.

Input

Scraped profiles: CSV or JSON file path (argument or ask user)
Profiles must have at minimum: firstName, lastName, companyName, and a LinkedIn URL field (linkedInProfileUrl or profileUrl)

Configuration

All settings come from config.json:

{
  "filter_sheets": {
    "spreadsheet_id": "...",
    "past_candidates": "Past Candidates",
    "blacklist": "Blacklist",
    "not_relevant_companies": "NotRelevant Companies"
  }
}

Google credentials: google_credentials.json in project root.

Step-by-Step Process

Step 1: Load Profiles

Read the input file (CSV or JSON). Print count and sample.

import pandas as pd
import json

# CSV from PhantomBuster
df = pd.read_csv(input_path)
# OR JSON
# with open(input_path) as f:
#     profiles = json.load(f)
#     df = pd.DataFrame(profiles)

print(f"Loaded {len(df)} profiles")

Step 2: Load Filter Lists from Google Sheets

Use gspread with the service account credentials to load each filter sheet.

import gspread
from google.oauth2.service_account import Credentials

scopes = ['https://www.googleapis.com/auth/spreadsheets.readonly',
           'https://www.googleapis.com/auth/drive.readonly']
creds = Credentials.from_service_account_file('google_credentials.json', scopes=scopes)
gc = gspread.authorize(creds)

config = json.load(open('config.json'))
sheet_id = config['filter_sheets']['spreadsheet_id']
spreadsheet = gc.open_by_key(sheet_id)

Load each list:

# Past Candidates - "Name" column (col 1), match by full name
past_ws = spreadsheet.worksheet(config['filter_sheets']['past_candidates'])
past_names = set(str(n).lower().strip() for n in past_ws.col_values(1)[1:] if n.strip())

# Blacklist - "Company" column (col 1), match by company name
bl_ws = spreadsheet.worksheet(config['filter_sheets']['blacklist'])
blacklist_companies = [str(c).strip() for c in bl_ws.col_values(1)[1:] if c.strip()]

# Not Relevant Companies - ALL columns, match by company name
nr_ws = spreadsheet.worksheet(config['filter_sheets']['not_relevant_companies'])
nr_all_values = nr_ws.get_all_values()
not_relevant = []
for row in nr_all_values[1:]:  # skip header
    for cell in row:
        if cell and cell.strip():
            not_relevant.append(cell.strip())
not_relevant = list(set(not_relevant))

Step 3: Company Name Normalization

Use the same normalization as linkedin-enricher for reliable matching:

def normalize_company(name):
    """Normalize company name for comparison."""
    if not name or not str(name).strip():
        return ''
    name = str(name).lower().strip()
    for suffix in [' ltd', ' inc', ' corp', ' llc', ' limited', ' israel', ' il',
                   ' technologies', ' tech', ' software', ' solutions', ' group']:
        if name.endswith(suffix):
            name = name[:-len(suffix)].strip()
    return name

def matches_company_list(company, company_list):
    """Check if company matches any in the list."""
    if not company or not str(company).strip():
        return False
    company_norm = normalize_company(company)
    if not company_norm:
        return False
    for c in company_list:
        c_norm = normalize_company(c)
        if not c_norm:
            continue
        if company_norm == c_norm:
            return True
        # Substring match for variants like "Bank Leumi" vs "Bank Leumi Le-Israel"
        if len(c_norm) >= 4 and len(company_norm) >= 4:
            if company_norm.startswith(c_norm) or c_norm.startswith(company_norm):
                return True
    return False

Step 4: Apply Filters

Apply filters in order, tracking removals:

removed = {
    'past_candidates': [],
    'blacklist': [],
    'not_relevant': [],
    'already_enriched': [],
}

filtered = []

for _, row in df.iterrows():
    profile = row.to_dict()

    # 1. Past candidates - match by full name
    first = str(profile.get('firstName', '')).lower().strip()
    last = str(profile.get('lastName', '')).lower().strip()
    full_name = f"{first} {last}"
    if full_name in past_names:
        removed['past_candidates'].append(profile)
        continue

    # 2. Blacklist - match by company name
    company = profile.get('companyName') or ''
    if matches_company_list(company, blacklist_companies):
        removed['blacklist'].append(profile)
        continue

    # 3. Not relevant companies - match by company name
    if matches_company_list(company, not_relevant):
        removed['not_relevant'].append(profile)
        continue

    filtered.append(profile)

Step 5: Supabase Dedup (Optional)

If Supabase is configured, check for already-enriched profiles:

from core.db import get_supabase_client, get_recently_enriched_urls
from core.normalizers import normalize_linkedin_url

db = get_supabase_client()
if db:
    enriched_urls = set(get_recently_enriched_urls(db, months=3))
    still_filtered = []
    for p in filtered:
        url = normalize_linkedin_url(p.get('linkedInProfileUrl') or p.get('profileUrl') or '')
        if url and url in enriched_urls:
            removed['already_enriched'].append(p)
        else:
            still_filtered.append(p)
    filtered = still_filtered

Step 6: Report Results

Print a clear summary:

============================================================
  PRE-FILTER RESULTS
============================================================
  Input:                79 profiles
  Output:               62 profiles

  Removed:
    Past candidates:     5
    Blacklist:           3
    Not relevant:        7
    Already enriched:    2
    Total removed:      17
============================================================

Step 7: Send to Slack for Approval

Use the Slack client to send results and ask for approval:

from integrations.slack import get_slack_client, SlackApprovalHandler

slack = get_slack_client()
if slack:
    # Send summary
    slack.send_message(
        f":mag: *[autofleet-fs-tl] Pre-Filter Results*\n"
        f"Input: {len(df)} profiles\n"
        f"Output: {len(filtered)} profiles\n\n"
        f"*Removed:*\n"
        f"  Past candidates: {len(removed['past_candidates'])}\n"
        f"  Blacklist: {len(removed['blacklist'])}\n"
        f"  Not relevant: {len(removed['not_relevant'])}\n"
        f"  Already enriched: {len(removed['already_enriched'])}\n"
        f"  Total: {sum(len(v) for v in removed.values())}"
    )

    # Ask approval
    handler = SlackApprovalHandler(slack)
    approval = handler.request_approval(
        'pre_filter',
        f'[autofleet-fs-tl] Pre-filter complete: {len(df)} -> {len(filtered)} profiles. Proceed to enrichment?',
        f'Removed {sum(len(v) for v in removed.values())} profiles total'
    )

Step 8: Save Output

Save filtered profiles for the next pipeline step:

# Save filtered profiles
output_path = 'data/sessions/autofleet-fs-tl_prefiltered.json'
with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(filtered, f, indent=2, default=str)

# Save removed profiles for audit
removed_path = 'data/sessions/autofleet-fs-tl_prefilter_removed.json'
with open(removed_path, 'w', encoding='utf-8') as f:
    json.dump({k: v for k, v in removed.items()}, f, indent=2, default=str)

Important Notes

Past candidates match by NAME (first + last, lowercased), NOT by LinkedIn URL
Company matching uses normalization - removes suffixes like Ltd, Inc, Technologies, etc.
Company matching includes prefix matching - "Bank Leumi" matches "Bank Leumi Le-Israel"
Not Relevant reads ALL columns from the sheet (some sheets have multiple company columns)
Blacklist reads column 1 (Company column)
Always save both filtered AND removed profiles for audit trail
Position ID should be extracted from the input file path or asked from user