name: hops-fv description: Use when writing Python code that creates, queries, or manages Hopsworks feature views via the hsfs SDK. Auto-invoke when user builds feature views, selects features, applies transformations, creates training data, retrieves feature vectors, or asks about feature view best practices (labels, filters, joins, transformations, online serving, embeddings).
Hopsworks Feature Views — Python SDK Best Practices
A feature view defines a set of features from one or more feature groups, joined together via a Query. It is the read interface of the feature store — the V between F and T/I in the FTI pattern.
Contract
- Input: one or more feature groups, a feature selection (
select/join) forming a Query, an optional label column, and optional transformation functions. - Output: a named, versioned feature view that produces reproducible training datasets, online feature vectors for serving, and offline batch data.
- Pre-condition: the source feature groups already exist. For online serving every feature group in the view must be
online_enabled(sole exception: all-on-demand views). The label, if any, must be in the query selection.
What Is a Feature View
A feature view defines a set of features from one or more feature groups, joined together via a Query. It is the single interface used for:
- Creating reproducible training datasets
- Retrieving online feature vectors for model serving
- Batch scoring with offline data
It is metadata-only (stores no data) and is the feature store's mechanism for preventing training/serving skew: it returns the same ordered features and applies the same model-dependent (MDT) and on-demand (ODT) transformations in training and inference pipelines.
Smoke-test
Verify state with the hops CLI (cheap pre/post-flight):
hops fv list # list feature views (id, name, version, labels)
hops fv info <name> --version 1 # metadata + schema; flags the label column
hops td list <fv-name> --version 1 # training-dataset versions
hops fv get <name> --version 1 --entry "pk=val" # one online feature vector, no Python
Non-interactive delete needs flags: hops fv delete <name> --version 1 --yes --force.
Build the view + training data from the CLI
The whole F→T handoff can run from the CLI, no Python — this is what the terminal
kickoff flow uses. Register any custom transforms first (the --transform flag
resolves them by name from the transformation store, so an unregistered udf
fails):
hops transformation create --file transformations.py # register udfs first
hops fv create <name> --feature-group <derived_fg>:1 \
--transform <fn>:<col> --labels <label> # --join "<fg> LEFT <on>" repeatable
hops td compute <fv> <fv_version> --split "train:0.8,test:0.2" # positional = FEATURE-VIEW version
hops td list <fv> # TD version auto-increments; read it back here
hops fv create --feature-group takes name[:version]; --transform and --join
are repeatable. hops td compute takes the FV version as a required positional —
the training-dataset version it writes auto-increments, so read it from hops td list
rather than assuming 1.
Ask the user (only when state is ambiguous)
- Label column — which selected feature is the prediction target (or none, for an unsupervised / retrieval view). It must be in the query selection.
- Which features — which feature groups and columns to select, and how they join.
- Online vs offline source FGs — whether the view needs online serving. If yes, every source feature group must be
online_enabled; confirm before relying oninit_serving(). - Before deleting —
fv.delete()/hops fv delete --yes --forceis irreversible; confirm the exact name and version with the user, and never tear down a feature view you created as a side effect (temp or test ones included) unless they asked.
Creating a Feature View
1. Build a Query (Feature Selection)
Feature selection starts from a root feature group (often, but not necessarily, the one holding the label). From the root you can reach any feature group connected by a join key path; a feature group with no path from the root cannot be included. Use select(), select_all(), or select_except() on a feature group to create a Query, then join additional queries.
import hopsworks
project = hopsworks.login()
fs = project.get_feature_store()
# Get feature groups
transactions_fg = fs.get_feature_group("transactions", version=1)
users_fg = fs.get_feature_group("users", version=1)
# Select specific features
query = (
transactions_fg.select(["user_id", "amount", "merchant", "timestamp"])
.join(users_fg.select(["user_id", "age", "country"]), on=["user_id"])
)
Feature Selection Methods
| Method | Description |
|---|---|
fg.select(["col1", "col2"]) |
Select specific features |
fg.select_all() |
Select all features |
fg.select_except(["col1"]) |
Select all except named features |
fg.select_all(include_primary_key=False, include_event_time=False) |
Select all, excluding keys/timestamps |
Join Types
query = fg1.select_all().join(fg2.select_all(), on=["shared_key"], join_type="left")
Supported join types: "left" (default), "inner", "right", "full", "cross", "left_semi_join".
For different key names on each side:
query = fg1.select_all().join(
fg2.select_all(),
left_on=["user_id"],
right_on=["customer_id"],
join_type="inner",
)
Use prefix to avoid column name clashes when joining feature groups with overlapping column names:
query = fg1.select_all().join(fg2.select_all(), on=["id"], prefix="fg2_")
2. Create the Feature View
feature_view = fs.create_feature_view(
name="fraud_detection_fv",
version=1,
description="Features for fraud detection model",
query=query,
labels=["is_fraud"],
inference_helper_columns=["merchant"],
training_helper_columns=["timestamp"],
transformation_functions=[...], # see Transformations section
)
Or get-or-create (idempotent):
feature_view = fs.get_or_create_feature_view(
name="fraud_detection_fv",
version=1,
query=query,
labels=["is_fraud"],
)
Set a description. Pass description= on the feature view so it is not an empty envelope in the UI. Per-feature descriptions come from the source feature groups, so describe the columns at the FG (see hops-fg), not here.
The label must be in the query selection.
labels=[...]only marks which already-selected columns are targets; it does not add them. If the label is not in yourselect(...)(or is dropped byselect_except([...])), create fails withFeatureStoreException: Feature name '<label>' could not be found in query. Select the label, then name it inlabels=. (The examples above assumeis_fraudis part ofquery.)
Key Parameters
| Parameter | Type | Description |
|---|---|---|
name |
str |
Feature view name |
query |
Query |
Query defining feature selection and joins |
version |
int |
Version number (auto-increments if None) |
labels |
list[str] |
Which selected features are the prediction target. Must be present in the query selection (labels= marks, it does not select). Excluded from feature vectors at inference |
inference_helper_columns |
list[str] |
Features not used in model but available during inference (e.g., for post-processing). Excluded from get_feature_vector(), available via get_inference_helper() |
training_helper_columns |
list[str] |
Features not in model schema but useful during training (e.g., for sampling, or for slicing evaluation data by a sensitive attribute like gender to check for bias without training on it). Excluded at inference time |
transformation_functions |
list |
Model-dependent transformations (see below) |
logging_enabled |
bool |
Enable feature vector logging |
Online Model Serving Requirement
All feature groups in the feature view must be online_enabled for online feature vector retrieval. If any feature group is not online-enabled, init_serving() will raise an error.
The only exception: if all features in the feature view are on-demand (computed at runtime by transformation functions, not stored in any feature group), then online-enabled is not required.
# This will fail at init_serving() if users_fg is NOT online_enabled
feature_view = fs.create_feature_view(
name="my_fv",
query=transactions_fg.select_all().join(users_fg.select_all()),
...
)
feature_view.init_serving() # raises FeatureStoreException
Vector Embeddings in Feature Views
Feature groups with an embedding_index (vector embeddings) can be included in a feature view. However:
- Embedding feature groups are skipped from standard online feature store lookups — their features are retrieved from the vector database instead
- You cannot use embedding features as precomputed features (via
passed_features) inget_feature_vector()— they are resolved from the vector DB by primary key - Use
find_neighbors()on the feature view to perform similarity search, which then automatically retrieves the full feature vector for each neighbor
from hsfs.embedding import EmbeddingIndex, EmbeddingFeature, SimilarityFunctionType
# Feature group with embeddings
embedding_index = EmbeddingIndex()
embedding_index.add_embedding(name="user_vector", dimension=384)
embedding_fg = fs.get_or_create_feature_group(
name="user_embeddings",
embedding_index=embedding_index,
primary_key=["user_id"],
online_enabled=True,
stream=True,
...
)
# Include in feature view
fv = fs.create_feature_view(
name="recommendation_fv",
query=embedding_fg.select_all().join(profile_fg.select_all(), on=["user_id"]),
...
)
# Similarity search — returns full feature vectors for nearest neighbors
fv.init_serving()
neighbors = fv.find_neighbors(
embedding=[0.1, 0.2, ...], # query vector
k=10,
filter=(embedding_fg.active == True), # optional filter
)
Filters
Filters can be applied at multiple levels: on queries (which define feature views), when creating training data, and during batch/online retrieval.
Filter Operators
Features support Python comparison operators that produce Filter objects:
| Operator | Meaning |
|---|---|
fg.col == value |
Equals |
fg.col != value |
Not equals |
fg.col > value |
Greater than |
fg.col >= value |
Greater than or equal |
fg.col < value |
Less than |
fg.col <= value |
Less than or equal |
fg.col.isin([v1, v2]) |
In list |
fg.col.like("pattern%") |
SQL LIKE pattern |
Combine with & (AND) and | (OR). Always use parentheses:
combined = (fg.amount > 100) & (fg.status == "active")
either = (fg.country == "US") | (fg.country == "CA")
Filters on the Query (Feature View Definition)
Filters applied to the query are baked into the feature view and always active:
query = (
transactions_fg.select_all()
.filter(transactions_fg.amount > 0)
.join(
users_fg.select_all().filter(users_fg.active == True),
on=["user_id"],
)
)
fv = fs.create_feature_view(name="my_fv", query=query, ...)
Filters on Training Data (extra_filter)
Apply additional filters when creating training data without changing the feature view definition:
version, job = fv.create_training_data(
extra_filter=(transactions_fg.merchant != "test_merchant") & (users_fg.age >= 18),
start_time="2025-01-01",
end_time="2025-06-01",
)
The extra_filter is stored with the training dataset metadata and automatically reapplied when reading with get_batch_data().
Time-Based Filters
Training data methods support start_time / end_time parameters that filter on the feature group's event_time column:
version, job = fv.create_training_data(
start_time="2025-01-01",
end_time="2025-06-01",
description="H1 2025 training data",
)
Filters on Vector Similarity Search
neighbors = fv.find_neighbors(
embedding=[0.1, 0.2, 0.3],
k=5,
filter=(fg.category == "electronics") & (fg.price < 1000),
)
Transformations
Transformations on a feature view are the T in FTI. Two kinds:
- Model-dependent (MDT) — statistics-based (scalers, encoders, imputers), specific to one model, attached here via
transformation_functions=oncreate_feature_view; applied as the last step before the model, at both training and serving. Training-dataset statistics (mean, min/max, encoding maps) are stored with the training dataset, soinit_serving/init_batch_scoringtake atraining_dataset_versionto apply the exact same MDT at inference and avoid skew. - On-demand (ODT) — computed at request time from
request_parameters, registered at the feature group (not the FV, since they also run in feature pipelines); auto-included when this FV selects them.
from hsfs.builtin_transformations import standard_scaler, label_encoder, impute_mean
fv = fs.create_feature_view(
name="my_fv", version=1, query=query, labels=["target"],
transformation_functions=[impute_mean("age"), standard_scaler("age"), label_encoder("country")],
)
# on-demand at serving:
fv.get_feature_vector(entry={"user_id": 123}, request_parameters={"current_location": "NYC"})
A transform renames its output to
<fn>_<col>_, and a udf is frozen into the FV at create (fixing it forces a new FV version + retrain). Custom udfs import fromhopsworks(from hopsworks import udf), and a default-mode udf must run on both a scalar (online) and a Series (offline) or it 500s on the first online predict.
Full transformation reference — built-in tables, the @udf decorator, execution modes, statistics, on-demand, and the transformation store: see hops-transformations.
Labels
Labels are features used as the prediction target. They are:
- Included in training data (returned as a separate DataFrame)
- Excluded from feature vectors at inference time
- Excluded from
get_batch_data()output by default
fv = fs.create_feature_view(
name="churn_fv",
query=query,
labels=["churned"], # single label
# labels=["label1", "label2"], # multi-label
)
# Training: labels returned separately
X_train, X_test, y_train, y_test = fv.get_train_test_split(training_dataset_version=1)
# Inference: labels excluded automatically
vector = fv.get_feature_vector(entry={"customer_id": 42})
Training Data
Create and Materialize Training Data
# Full dataset
version, job = fv.create_training_data(
description="Q1 2025 training data",
start_time="2025-01-01",
end_time="2025-04-01",
data_format="parquet",
statistics_config=True,
extra_filter=(fg.active == True),
)
# Train/test split (random)
version, job = fv.create_train_test_split(
test_size=0.2,
seed=42,
data_format="parquet",
)
# Train/test split (time-based)
version, job = fv.create_train_test_split(
train_start="2024-01-01",
train_end="2025-01-01",
test_start="2025-01-01",
test_end="2025-04-01",
)
# Train/validation/test split
version, job = fv.create_train_validation_test_split(
validation_size=0.1,
test_size=0.1,
seed=42,
)
Retrieve Materialized Training Data
X, y = fv.get_training_data(
training_dataset_version=1,
dataframe_type="polars",
primary_key=False, # exclude primary keys
event_time=False, # exclude event time
training_helper_columns=False,
)
X_train, X_test, y_train, y_test = fv.get_train_test_split(
training_dataset_version=1,
dataframe_type="polars",
)
X_train, X_val, X_test, y_train, y_val, y_test = fv.get_train_validation_test_split(
training_dataset_version=1,
dataframe_type="pandas",
)
In-Memory Training Data (No Materialization)
For quick iteration, get training data directly as DataFrames without materializing to storage:
X, y = fv.training_data(
start_time="2025-01-01",
end_time="2025-04-01",
dataframe_type="polars",
)
X_train, X_test, y_train, y_test = fv.train_test_split(
test_size=0.2,
dataframe_type="polars",
)
These still create metadata for reproducibility but skip writing to storage.
Training Data Parameters
| Parameter | Type | Description |
|---|---|---|
start_time / end_time |
str, datetime, int |
Filter by event time (inclusive start, exclusive end) |
extra_filter |
Filter / Logic |
Additional filter expression |
data_format |
str |
"parquet", "csv", "tfrecord", "avro", "orc", "json" |
coalesce |
bool |
Write to single file (default: False) |
seed |
int |
Random seed for reproducible splits |
statistics_config |
bool / dict |
Compute statistics for transformations |
spine |
DataFrame |
Spine for point-in-time joins (Spark only) |
test_size |
float |
Fraction for test set (0-1) |
validation_size |
float |
Fraction for validation set (0-1) |
Online Feature Vector Retrieval
Initialize Serving
fv.init_serving(
training_dataset_version=1, # version with transformation statistics
external=None, # auto-detect environment
)
Single Feature Vector
vector = fv.get_feature_vector(
entry={"user_id": 123},
return_type="pandas", # "list" (default), "pandas", "polars", "numpy"
)
Batch Feature Vectors
vectors = fv.get_feature_vectors(
entry=[{"user_id": 123}, {"user_id": 456}],
return_type="pandas",
)
Passed Features (Runtime Values)
Provide feature values from the application that override or supplement stored features:
vector = fv.get_feature_vector(
entry={"user_id": 123},
passed_features={"device_type": "mobile", "session_duration": 45.2},
)
Feature value priority (highest to lowest):
request_parameters— on-demand transformation inputspassed_features— runtime application values- Online feature store — stored values
- On-demand computation — computed features
Request Parameters (On-Demand Features)
vector = fv.get_feature_vector(
entry={"user_id": 123},
request_parameters={"query_text": "running shoes"},
)
Inference Helper Columns
Retrieved separately from the feature vector:
helpers = fv.get_inference_helper(
entry={"user_id": 123},
return_type="dict", # "pandas" (default), "dict", "polars"
)
Control Transformations
# Get untransformed feature vector
raw_vector = fv.get_feature_vector(
entry={"user_id": 123},
transform=False,
)
# Skip on-demand features
vector = fv.get_feature_vector(
entry={"user_id": 123},
on_demand_features=False,
)
Batch Scoring (Offline)
# Initialize batch scoring (optional — called automatically)
fv.init_batch_scoring(training_dataset_version=1)
# Get batch data with transformations applied
batch_df = fv.get_batch_data(
start_time="2025-03-01",
end_time="2025-04-01",
dataframe_type="polars",
transformed=True, # apply model-dependent transformations
primary_key=False,
event_time=False,
inference_helper_columns=False,
)
Complete Example: End-to-End Feature View Pipeline
import hopsworks
from hsfs.builtin_transformations import standard_scaler, label_encoder, impute_mean
from hopsworks import udf
# 1. Connect
project = hopsworks.login()
fs = project.get_feature_store()
# 2. Get feature groups (all must be online_enabled for online serving)
transactions_fg = fs.get_feature_group("transactions", version=1)
users_fg = fs.get_feature_group("users", version=1)
# 3. Build query with feature selection and filters
query = (
transactions_fg.select(["user_id", "amount", "merchant", "category", "timestamp", "is_fraud"])
.filter(transactions_fg.amount > 0)
.join(
users_fg.select(["user_id", "age", "country", "account_age_days"]),
on=["user_id"],
)
)
# 4. Define custom transformation
@udf(float, drop=["amount"])
def log_amount(amount):
import math
return math.log1p(amount)
# 5. Create feature view
fv = fs.get_or_create_feature_view(
name="fraud_detection_fv",
version=1,
query=query,
labels=["is_fraud"],
inference_helper_columns=["merchant"],
training_helper_columns=["timestamp"],
transformation_functions=[
impute_mean("age"),
standard_scaler("age"),
standard_scaler("account_age_days"),
log_amount("amount"),
label_encoder("category"),
label_encoder("country"),
],
)
# 6. Create training data
version, job = fv.create_train_test_split(
test_size=0.2,
seed=42,
statistics_config=True,
description="Fraud detection training data",
)
# 7. Retrieve training data
X_train, X_test, y_train, y_test = fv.get_train_test_split(
training_dataset_version=version,
dataframe_type="pandas",
)
# 8. Train model (user's code)
model = train_model(X_train, y_train)
# 9. Online serving
fv.init_serving(training_dataset_version=version)
vector = fv.get_feature_vector(
entry={"user_id": 42},
return_type="list",
)
prediction = model.predict([vector])
# 10. Batch scoring
batch_df = fv.get_batch_data(
start_time="2025-03-01",
end_time="2025-04-01",
dataframe_type="pandas",
)
batch_predictions = model.predict(batch_df)
Quick Reference
| Task | Code |
|---|---|
| Create feature view | fs.create_feature_view(name=..., query=..., labels=[...]) |
| Get feature view | fs.get_feature_view("name", version=1) |
| Select features | fg.select(["col1", "col2"]) |
| Select all except | fg.select_except(["col1"]) |
| Join feature groups | fg1.select_all().join(fg2.select_all(), on=["key"]) |
| Filter query | .filter((fg.col > 10) & (fg.col2 == "x")) |
| Create training data | fv.create_training_data(start_time=..., end_time=...) |
| Train/test split | fv.create_train_test_split(test_size=0.2) |
| Get training data | fv.get_training_data(training_dataset_version=1) |
| In-memory training data | fv.training_data(start_time=..., end_time=...) |
| Init online serving | fv.init_serving(training_dataset_version=1) |
| Get feature vector | fv.get_feature_vector(entry={"pk": val}) |
| Batch feature vectors | fv.get_feature_vectors(entry=[{"pk": v1}, {"pk": v2}]) |
| Get inference helpers | fv.get_inference_helper(entry={"pk": val}) |
| Batch scoring | fv.get_batch_data(start_time=..., end_time=...) |
| Similarity search | fv.find_neighbors(embedding=[...], k=10) |
| Delete feature view | fv.delete() |
Next Steps
- Train a model on this view's training data: hops-train.
- Batch scoring: hops-batch-inference. Online serving: hops-online-inference.
- Need to create or fix the source feature groups first: hops-fg.