name: unstructured-data description: Parses unstructured data from files and writes it to a feature group.
Unstructured Data Extraction
This is a feature pipeline task. Extracting structured fields from unstructured files is a Model-Independent Transformation (MIT): it produces reusable features that many models can later consume from the feature store, not features tied to one model. The lowest-cost feature pipeline is the one you don't have to create, so favour a clean, reusable schema over per-model shortcuts.
Contract
- Input: raw unstructured files (emails, PDFs, logs, transcripts, scraped HTML) and a target DataFrame schema (Pandas, Polars, PySpark).
- Output: a populated Hopsworks feature group holding the structured (reusable) features extracted from the files.
- Pre-condition: the raw files are accessible and a target schema has been agreed with the developer.
Steps
You first examine the unstructured text to identify and propose schemas for the classes of unstructured files. Once you have agreed with the developer on a schema, you extract structured data from unstructured text. Given raw input (emails, PDFs, logs, transcripts, scraped HTML) and the target schema for a DataFrame (Pandas, Polars, PySpark):
- Read the schema first. Note required vs optional fields, enums, and format constraints (dates, currencies, IDs). The schema is the contract — never emit a key it doesn't define.
- Scan the input for each field. Prefer explicit values over inferred ones. If a required field is genuinely absent, use null rather than guessing.
- Normalize as you extract: trim whitespace, coerce dates to ISO 8601, strip currency symbols into numeric + code, collapse enum synonyms to their canonical value.
- For every type of unstructured file, create a DataFrame based on its schema and populate it with the data from the unstructured files.
- Write the DataFrame to a feature group in Hopsworks (get_or_create the feature group).
When the input is ambiguous, pick the most conservative interpretation and note the ambiguity in a top-level "_extraction_notes" field.
Next Steps
- Create and write the target feature group: hops-fg.
- For RAG over the extracted text, chunk it and compute vector embeddings (both are MITs run in the feature pipeline), store them in an embedding index (hops-fg embeddings section), and serve similarity (kNN) search via hops-fv.