name: filter description: Pre-filter scraped LinkedIn profiles against Google Sheets exclusion lists (past candidates, blacklist, not-relevant companies) and Supabase dedup. Sends results to Slack for approval. Use when you have scraped profiles that need filtering before enrichment. argument-hint: [scraped-profiles-csv-or-json]
Pre-Filter Skill
Filter scraped LinkedIn profiles against exclusion lists and report results via Slack.
Your Task
Take scraped profiles (from PhantomBuster CSV or JSON) and filter them against:
- Past Candidates - by full name match (Google Sheet)
- Blacklist - by company name match (Google Sheet)
- Not Relevant Companies - by company name match (Google Sheet)
- Already Enriched - by LinkedIn URL (Supabase dedup)
Then send a summary to Slack and ask for approval to proceed.
Input
- Scraped profiles: CSV or JSON file path (argument or ask user)
- Profiles must have at minimum:
firstName,lastName,companyName, and a LinkedIn URL field (linkedInProfileUrlorprofileUrl)
Configuration
All settings come from config.json:
{
"filter_sheets": {
"spreadsheet_id": "...",
"past_candidates": "Past Candidates",
"blacklist": "Blacklist",
"not_relevant_companies": "NotRelevant Companies"
}
}
Google credentials: google_credentials.json in project root.
Step-by-Step Process
Step 1: Load Profiles
Read the input file (CSV or JSON). Print count and sample.
import pandas as pd
import json
# CSV from PhantomBuster
df = pd.read_csv(input_path)
# OR JSON
# with open(input_path) as f:
# profiles = json.load(f)
# df = pd.DataFrame(profiles)
print(f"Loaded {len(df)} profiles")
Step 2: Load Filter Lists from Google Sheets
Use gspread with the service account credentials to load each filter sheet.
import gspread
from google.oauth2.service_account import Credentials
scopes = ['https://www.googleapis.com/auth/spreadsheets.readonly',
'https://www.googleapis.com/auth/drive.readonly']
creds = Credentials.from_service_account_file('google_credentials.json', scopes=scopes)
gc = gspread.authorize(creds)
config = json.load(open('config.json'))
sheet_id = config['filter_sheets']['spreadsheet_id']
spreadsheet = gc.open_by_key(sheet_id)
Load each list:
# Past Candidates - "Name" column (col 1), match by full name
past_ws = spreadsheet.worksheet(config['filter_sheets']['past_candidates'])
past_names = set(str(n).lower().strip() for n in past_ws.col_values(1)[1:] if n.strip())
# Blacklist - "Company" column (col 1), match by company name
bl_ws = spreadsheet.worksheet(config['filter_sheets']['blacklist'])
blacklist_companies = [str(c).strip() for c in bl_ws.col_values(1)[1:] if c.strip()]
# Not Relevant Companies - ALL columns, match by company name
nr_ws = spreadsheet.worksheet(config['filter_sheets']['not_relevant_companies'])
nr_all_values = nr_ws.get_all_values()
not_relevant = []
for row in nr_all_values[1:]: # skip header
for cell in row:
if cell and cell.strip():
not_relevant.append(cell.strip())
not_relevant = list(set(not_relevant))
Step 3: Company Name Normalization
Use the same normalization as linkedin-enricher for reliable matching:
def normalize_company(name):
"""Normalize company name for comparison."""
if not name or not str(name).strip():
return ''
name = str(name).lower().strip()
for suffix in [' ltd', ' inc', ' corp', ' llc', ' limited', ' israel', ' il',
' technologies', ' tech', ' software', ' solutions', ' group']:
if name.endswith(suffix):
name = name[:-len(suffix)].strip()
return name
def matches_company_list(company, company_list):
"""Check if company matches any in the list."""
if not company or not str(company).strip():
return False
company_norm = normalize_company(company)
if not company_norm:
return False
for c in company_list:
c_norm = normalize_company(c)
if not c_norm:
continue
if company_norm == c_norm:
return True
# Substring match for variants like "Bank Leumi" vs "Bank Leumi Le-Israel"
if len(c_norm) >= 4 and len(company_norm) >= 4:
if company_norm.startswith(c_norm) or c_norm.startswith(company_norm):
return True
return False
Step 4: Apply Filters
Apply filters in order, tracking removals:
removed = {
'past_candidates': [],
'blacklist': [],
'not_relevant': [],
'already_enriched': [],
}
filtered = []
for _, row in df.iterrows():
profile = row.to_dict()
# 1. Past candidates - match by full name
first = str(profile.get('firstName', '')).lower().strip()
last = str(profile.get('lastName', '')).lower().strip()
full_name = f"{first} {last}"
if full_name in past_names:
removed['past_candidates'].append(profile)
continue
# 2. Blacklist - match by company name
company = profile.get('companyName') or ''
if matches_company_list(company, blacklist_companies):
removed['blacklist'].append(profile)
continue
# 3. Not relevant companies - match by company name
if matches_company_list(company, not_relevant):
removed['not_relevant'].append(profile)
continue
filtered.append(profile)
Step 5: Supabase Dedup (Optional)
If Supabase is configured, check for already-enriched profiles:
from core.db import get_supabase_client, get_recently_enriched_urls
from core.normalizers import normalize_linkedin_url
db = get_supabase_client()
if db:
enriched_urls = set(get_recently_enriched_urls(db, months=3))
still_filtered = []
for p in filtered:
url = normalize_linkedin_url(p.get('linkedInProfileUrl') or p.get('profileUrl') or '')
if url and url in enriched_urls:
removed['already_enriched'].append(p)
else:
still_filtered.append(p)
filtered = still_filtered
Step 6: Report Results
Print a clear summary:
============================================================
PRE-FILTER RESULTS
============================================================
Input: 79 profiles
Output: 62 profiles
Removed:
Past candidates: 5
Blacklist: 3
Not relevant: 7
Already enriched: 2
Total removed: 17
============================================================
Step 7: Send to Slack for Approval
Use the Slack client to send results and ask for approval:
from integrations.slack import get_slack_client, SlackApprovalHandler
slack = get_slack_client()
if slack:
# Send summary
slack.send_message(
f":mag: *[autofleet-fs-tl] Pre-Filter Results*\n"
f"Input: {len(df)} profiles\n"
f"Output: {len(filtered)} profiles\n\n"
f"*Removed:*\n"
f" Past candidates: {len(removed['past_candidates'])}\n"
f" Blacklist: {len(removed['blacklist'])}\n"
f" Not relevant: {len(removed['not_relevant'])}\n"
f" Already enriched: {len(removed['already_enriched'])}\n"
f" Total: {sum(len(v) for v in removed.values())}"
)
# Ask approval
handler = SlackApprovalHandler(slack)
approval = handler.request_approval(
'pre_filter',
f'[autofleet-fs-tl] Pre-filter complete: {len(df)} -> {len(filtered)} profiles. Proceed to enrichment?',
f'Removed {sum(len(v) for v in removed.values())} profiles total'
)
Step 8: Save Output
Save filtered profiles for the next pipeline step:
# Save filtered profiles
output_path = 'data/sessions/autofleet-fs-tl_prefiltered.json'
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(filtered, f, indent=2, default=str)
# Save removed profiles for audit
removed_path = 'data/sessions/autofleet-fs-tl_prefilter_removed.json'
with open(removed_path, 'w', encoding='utf-8') as f:
json.dump({k: v for k, v in removed.items()}, f, indent=2, default=str)
Important Notes
- Past candidates match by NAME (first + last, lowercased), NOT by LinkedIn URL
- Company matching uses normalization - removes suffixes like Ltd, Inc, Technologies, etc.
- Company matching includes prefix matching - "Bank Leumi" matches "Bank Leumi Le-Israel"
- Not Relevant reads ALL columns from the sheet (some sheets have multiple company columns)
- Blacklist reads column 1 (Company column)
- Always save both filtered AND removed profiles for audit trail
- Position ID should be extracted from the input file path or asked from user