name: brasa-template-writer description: Write new brasa YAML templates from scratch or migrate legacy templates (with reader.function and handler-based fields) to the modern pipeline-based format. Use when the user asks to create a new template, write a template, migrate a legacy template, convert an old template, or work with YAML template definitions. Also trigger when the user mentions template creation, template migration, or refers to legacy/old templates.
Brasa Template Writer
Write new templates from scratch or migrate legacy templates to modern pipeline-based format.
Mode Detection
- Write new: User asks to create/write/add a new template
- Migrate: User asks to migrate/convert/update a legacy template (one with
reader.functionorhandler:field blocks)
Workflow
Write New Template
Determine template type:
- Single-dataset reader: one set of fields, reads a single file format
- Multi-dataset reader: multiple datasets from one source (e.g., XML with multiple tags)
- ETL pipeline: transforms data from upstream datasets using SQL or steps
If user provides a full spec (URL, format, fields), generate directly
If info is missing, ask one question at a time:
- What data source? (URL, institution, file format)
- What file format? (CSV, FWF, JSON, XML, Excel)
- What fields? (names, types, descriptions)
- What layer? (input, staging, curated)
- Partitioning? (typically
[refdate])
Generate the template YAML
Save to
templates/in appropriate subdirectory based on data source:templates/b3/for B3 data (subdirs: equities, futures, indexes, raw, options)templates/anbima/for ANBIMAtemplates/bcb/for BCB/SGStemplates/cvm/for CVMtemplates/td/for Tesouro Direto
Migrate Legacy Template
- Read the legacy template from
templates/legacy/or wherever it lives - Analyze the reader function (see "Analyzing the Reader Function" below) to understand how fields are parsed
- Apply all transformations (see Migration Rules below)
- Save the new template to the appropriate
templates/subdirectory - Add a YAML comment at top noting it was migrated from the legacy version
Analyzing the Reader Function
The legacy reader.function contains valuable information about field parsing that should inform your field type definitions. Always examine the function implementation in brasa/readers/ before migrating.
Common patterns to look for:
| Code Pattern | Field Type |
|---|---|
pd.to_numeric(df[col].str.replace(",", "."), ...) |
type: numeric(decimal=",") |
pd.to_numeric(df[col].str.replace(".", "").str.replace(",", "."), ...) |
type: numeric(decimal=",", thousands=".") |
pd.to_datetime(df[col], format='%d/%m/%Y', ...) |
type: date(format='%d/%m/%Y') |
pd.to_datetime(df[col], dayfirst=True, ...) |
type: date (with dayfirst in reader context) |
pd.to_numeric(df[col], ...) |
type: numeric (plain numeric parsing) |
str.replace(",", ".") on numeric field |
indicates comma as decimal separator → use decimal="," |
str.replace(".", "").str.replace(",", ".") |
indicates European format (dot=thousands, comma=decimal) → use thousands=".", decimal="," |
Example: If the reader function has:
df["volume"] = pd.to_numeric(df["volume"].str.replace(",", "."), errors="coerce")
df["price"] = pd.to_numeric(df["price"].str.replace(",", "."), errors="coerce")
Then in the YAML, these should be:
- name: volume
type: numeric(decimal=",")
- name: price
type: numeric(decimal=",")
Not just type: numeric with no parameters.
Migration Rules
Field Type Migration
Legacy (handler:) |
Modern (type:) |
|---|---|
handler: {type: numeric, dec: 2.0} |
type: numeric(dec=2.0) |
handler: {type: numeric, dec: 0.0} |
type: integer |
handler: {type: Date, format: '%Y%m%d'} |
type: date(format='%Y%m%d') |
handler: {type: POSIXct, format: '%H%M%S'} |
type: datetime(format='%H%M%S') |
handler: {type: character} |
type: character |
handler: {type: factor, ...} |
type: character |
- Remove the
handler:block entirely from each field - Keep
name:,description:,width:(if FWF), and anytag:attributes - If
handler.decis a field reference (e.g.,dec: num_casas_decimais_2), add a comment noting the dynamic decimal and usetype: numericwithout dec parameter
Sign Field Migration
For numeric fields with sign: some_column:
- Convert the field to
type: numeric(dec=N)(without sign) - Add a YAML comment:
# sign: originally from <sign_column> - After the
apply_fieldsstep in the pipeline, add steps:
# Apply sign columns to their target numeric fields
- step: custom_simple
code: |
import numpy as np
sign_map = {
'cot_primeiro_negocio': 'sinal_cot_primeiro_negocio',
# ... list all sign->target pairs
}
for target, sign_col in sign_map.items():
if target in df.columns and sign_col in df.columns:
mask = df[sign_col].str.strip() == '-'
df.loc[mask, target] = -df.loc[mask, target]
# Drop sign columns (no longer needed)
- step: drop_columns
columns: [sinal_cot_primeiro_negocio, sinal_cot_menor_negocio, ...]
Also remove the sign fields from the fields: list.
Structural Migration
- Replace
reader: { function: ... }withreader: { pipeline: [...] } - Choose pipeline steps based on
filetype:FWF→read_fwf(dtype: str) → filter if needed →apply_fieldsCSV→read_csv→apply_fieldsJSON→read_json→apply_fieldsExcel/XLS→read_excel→apply_fields
- Add
writer:block withlayer: inputandpartitioning: [refdate] - Add
downloader:block if the legacy template has URL info or if user provides it - Remove
filename:andfiletype:top-level keys (these are inferred by the pipeline)
Field Type Reference
| Type | Parameters | Example |
|---|---|---|
character / string / char |
none | type: character |
integer / int |
none | type: integer |
numeric / number |
dec, sign, thousands, decimal |
type: numeric(dec=2.0) |
date |
format (default: %Y-%m-%d) |
type: date(format='%Y%m%d') |
datetime / posixct |
format (default: %Y-%m-%d %H:%M:%S) |
type: datetime(format='%H%M%S') |
time |
format (default: %H:%M:%S) |
type: time(format='%H%M') |
boolean / bool |
none | type: boolean |
Syntax Rules
- Parameters in parentheses:
typename(key=value) - String values in single quotes:
date(format='%Y%m%d') - Multiple params comma-separated:
numeric(dec=2, decimal=',') numeric(dec=0.0)for integers stored as fixed-width numbers → preferintegerin new templates
Pipeline Steps Reference
Comprehensive reference of all registered pipeline steps organized by category.
I/O Steps (Reading Data)
| Step | Parameters | Description |
|---|---|---|
read_csv |
separator, skip, header, names, converters |
Read CSV file with optional custom separator and header |
read_fwf |
colspecs, names, skip, dtype |
Read fixed-width format files (plain or gzip); derives column specs from field widths |
read_json |
orient, path |
Read JSON file (supports gzip) into DataFrame |
read_excel |
sheet, skip, header |
Read Excel file into DataFrame |
Column Manipulation Steps
| Step | Parameters | Description |
|---|---|---|
set_columns |
names (required) |
Set column names for DataFrame |
rename_columns |
mapping (required) |
Rename columns using dict mapping |
select_columns |
columns (required) |
Select specific columns to keep |
drop_columns |
columns (required) |
Drop columns from data |
add_column |
name, value, from (with where and key), only_if_missing |
Add new column with static or dynamic value |
reorder_columns |
order (required), keep_rest |
Reorder columns in DataFrame |
Data Transformation Steps
| Step | Parameters | Description |
|---|---|---|
apply_fields |
errors (coerce/raise/ignore), set_columns |
Apply field type definitions using Fieldset |
apply_fields_multi |
errors |
Apply field definitions to multiple DataFrames in dict |
parse_numeric |
columns (required), errors |
Parse string columns as numeric using context settings |
parse_date |
columns (required), format, errors |
Parse string columns as dates |
parse_datetime |
columns (required), format, errors |
Parse string columns as datetime values |
fill_na |
columns, value, method (ffill/bfill) |
Fill NA/NaN values |
drop_duplicates |
subset, keep (first/last/False) |
Remove duplicate rows |
drop_na |
columns, how (any/all) |
Drop rows with NA/NaN values |
filter_rows |
column, operator (eq/ne/gt/lt/etc), value |
Filter rows based on conditions |
forward_fill_column |
column (required), condition |
Forward fill values in column |
extract_regex |
column, pattern (required), output, group |
Extract values using regex capture groups |
concat_columns |
columns, output (required), separator |
Concatenate multiple columns into one |
melt |
id_vars, value_vars, var_name, value_name |
Unpivot DataFrame from wide to long format |
sort |
by (required), ascending, descending, na_position |
Sort data by columns |
make_date |
year_column, month_column, day_column (required), output, errors |
Create date column from components |
str_replace |
column, pattern (required), replacement, output, regex |
Replace pattern in string column |
cast |
column, dtype (required), errors |
Cast column(s) to specific type |
ETL Pipeline Steps
| Step | Parameters | Description |
|---|---|---|
load |
template OR (input, layer) |
Load a dataset as PyArrow Dataset |
concat_datasets |
inputs (required), layer (required), columns |
Concatenate multiple datasets vertically |
dataset_filter |
where (required) |
Filter rows by equality conditions |
dataset_select |
columns (required) |
Select columns from dataset |
select_fields |
(uses context.fields) | Select columns based on field names |
dataset_sort |
by (required), descending |
Sort PyArrow dataset |
dataset_drop_columns |
columns (required) |
Drop columns from dataset |
dataset_rename_columns |
mapping (required) |
Rename columns in dataset |
dataset_drop_duplicates |
subset, keep |
Remove duplicate rows from dataset |
dataset_fill_na |
value, method, columns |
Fill missing values in dataset |
to_dataframe |
(none) | Convert PyArrow Dataset/Table to pandas DataFrame |
sql_query |
datasets (required), query (required) |
Execute SQL on datasets in in-memory DuckDB |
future_maturity_to_date |
code_column, date_column (required), maturity_day, calendar |
Convert future maturity codes to dates |
following_bizday |
date_column, adjusted_column (required), calendar |
Adjust dates to following business day |
bizdays |
from_column, to_column, output_column (required), calendar |
Calculate business days between dates |
implied_rate |
price_column, rate_column, days_to_maturity_column (required), compounding, forward_price |
Calculate implied interest rate from price |
flatten_columns |
columns (required), separator |
Flatten delimited values into separate rows |
B3-Specific Steps
| Step | Parameters | Description |
|---|---|---|
b3_read_bvbg028_xml |
(uses datasets config) | Read/parse B3 BVBG028 gzipped XML file (returns Dict[str, DataFrame]) |
b3_read_bvbg086_xml |
(uses field tags) | Read/parse B3 BVBG086 gzipped XML file |
b3_read_bvbg087_xml |
(uses datasets config) | Read/parse B3 BVBG087 gzipped XML file |
b3_read_company_info_json |
(uses datasets config) | Read B3 company info gzipped JSON (returns Dict[str, DataFrame]) |
b3_read_company_details_json |
(none) | Read B3 company details JSON, expands otherCodes array |
b3_add_columns_from_json_fields |
mapping (required) |
Parse B3 JSON fields and add as columns |
b3_parse_refdate_from_html |
xpath, attribute, store_as |
Parse reference date from B3 HTML page |
b3_forward_fill_commodity |
column |
Forward fill commodity names in B3 settlement prices |
b3_extract_commodity_code |
column |
Extract commodity code from commodity name |
b3_create_symbol |
commodity_column, maturity_column, output_column |
Create futures symbol by concatenating commodity and maturity |
Canonical Examples
These templates serve as models for creating new templates of each type.
Example 1: Single-Dataset FWF Reader
Fixed-width format (FWF) file with type filtering and field conversion.
id: b3-cotahist-daily
description: Cotações Históricas do Pregão de Ações - Arquivo Diário
downloader:
verify_ssl: false
function: brasa.downloaders.datetime_download
url: https://bvmf.bmfbovespa.com.br/InstDados/SerHist/COTAHIST_D%d%m%Y.ZIP
format: zip
args:
refdate: ~
reader:
encoding: latin1
locale: en
pipeline:
- step: read_fwf
dtype: str
- step: filter_rows
column: regtype
operator: eq
value: '01'
- step: apply_fields
errors: coerce
writer:
layer: input
partitioning: [refdate]
fields:
- name: regtype
description: Tipo de registro
type: character
width: 2
- name: refdate
description: Data do pregão
type: date(format='%Y%m%d')
width: 8
- name: symbol
description: Código de negociação
type: character
width: 12
- name: open
description: Preço de abertura
type: numeric(dec=2.0)
width: 13
- name: high
description: Preço máximo
type: numeric(dec=2.0)
width: 13
- name: low
description: Preço mínimo
type: numeric(dec=2.0)
width: 13
- name: close
description: Preço último negócio
type: numeric(dec=2.0)
width: 13
- name: volume
description: Volume total negociado
type: numeric(dec=2.0)
width: 18
- name: traded_contracts
description: Quantidade negociada
type: integer
width: 18
Key features:
read_fwfreads fixed-width format (widths derived from field definitions)filter_rowskeeps only type "01" recordsapply_fieldswitherrors: coerceconverts to proper typesdatetime_downloadwithrefdateparameter
Example 2: Single-Dataset CSV Reader
CSV with custom separator, encoding, column renaming, and context variable injection.
id: cvm-companies-registration
description: Cadastro de companhias abertas da CVM
downloader:
function: brasa.downloaders.simple_download
verify_ssl: false
extra-key: date
url: https://dados.cvm.gov.br/dados/CIA_ABERTA/CAD/DADOS/cad_cia_aberta.csv
format: csv
reader:
locale: pt
encoding: latin1
pipeline:
- step: read_csv
separator: ";"
- step: add_column
from:
where: extra_key
name: refdate
- step: rename_columns
mapping:
CNPJ_CIA: cnpj_cia
DENOM_SOCIAL: denom_social
DT_REG: dt_reg
SETOR: setor
SUBSETOR: subsetor
- step: apply_fields
errors: coerce
writer:
layer: input
partitioning: [refdate]
fields:
- name: cnpj_cia
description: CNPJ da companhia
type: character
- name: denom_social
description: Denominação social
type: character
- name: dt_reg
description: Data de registro
type: date(format='%Y-%m-%d')
- name: setor
description: Setor econômico
type: character
- name: subsetor
description: Subsetor econômico
type: character
- name: refdate
description: Data de referência
type: date
Key features:
read_csvwith custom separator (semicolon)add_columnfromextra_key(download metadata)rename_columnsmapping for CSV headersencoding: latin1for non-UTF8 fileslocale: ptfor Portuguese number formatting
Example 3: Multi-Dataset XML Reader
XML file with multiple datasets (equities and options) extracted from different tags.
id: b3-bvbg028
description: Arquivo de Preços de Mercado
downloader:
function: brasa.downloaders.datetime_download
url: https://www.b3.com.br/pesquisapregao/download?filelist=IN%y%m%d.zip
format: zip
args:
refdate: ~
reader:
locale: en
pipeline:
- step: b3_read_bvbg028_xml
- step: apply_fields_multi
writer:
partitioning: [refdate]
datasets:
equities:
tag: EqtyInf
fields:
- name: refdate
description: Data de referência
tag: RptParams/RptDtAndTm/Dt
type: date
- name: symbol
description: Código de negociação
tag: InstrmInf/EqtyInf/TckrSymb
type: character
- name: isin
description: Código ISIN
tag: InstrmInf/EqtyInf/ISIN
type: character
- name: corporation_name
description: Razão social
tag: InstrmInf/EqtyInf/CrpnNm
type: character
- name: open
description: Preço de abertura
tag: InstrmInf/EqtyInf/FrstPric
type: numeric
- name: close
description: Preço de fechamento
tag: InstrmInf/EqtyInf/LastPric
type: numeric
options_on_equities:
tag: OptnOnEqtsInf
fields:
- name: refdate
description: Data de referência
tag: RptParams/RptDtAndTm/Dt
type: date
- name: symbol
description: Código de negociação
tag: InstrmInf/OptnOnEqtsInf/TckrSymb
type: character
- name: exercise_price
description: Preço de exercício
tag: InstrmInf/OptnOnEqtsInf/ExrcPric
type: numeric
- name: maturity_date
description: Data de vencimento
tag: InstrmInf/OptnOnEqtsInf/XprtnDt
type: date
Key features:
b3_read_bvbg028_xmlis a B3-specific step that parses XML into multiple DataFramesdatasets:block defines multiple output datasets with different XML tags- Each dataset has its own
fields:withtag:attributes for XPath extraction apply_fields_multiapplies field conversions to all datasets
Example 4: ETL with SQL
ETL template that loads upstream datasets and transforms via SQL query.
id: b3-equities-returns
description: Dataset de retornos de ações
etl:
pipeline:
- step: sql_query
datasets:
- staging.b3-cotahist
- staging.b3-equities-spot-market
query: |
WITH equity_symbols AS (
SELECT DISTINCT symbol
FROM 'staging.b3-equities-spot-market'
WHERE security_category IN (1, 11, 13)
)
SELECT
t.refdate,
t.symbol,
(t.close / LAG(t.close) OVER (PARTITION BY t.symbol ORDER BY t.refdate)) - 1 AS pct_return,
LN(t.close / LAG(t.close) OVER (PARTITION BY t.symbol ORDER BY t.refdate)) AS log_return
FROM 'staging.b3-cotahist' t
INNER JOIN equity_symbols s ON t.symbol = s.symbol
WHERE ROW_NUMBER() OVER (PARTITION BY t.symbol ORDER BY t.refdate) > 1
ORDER BY t.refdate, t.symbol
- step: apply_fields
errors: coerce
writer:
layer: staging
fields:
- name: refdate
description: Data de referência
type: date
- name: symbol
description: Símbolo do ativo
type: character
- name: pct_return
description: Retorno percentual
type: numeric
- name: log_return
description: Retorno logarítmico
type: numeric
Key features:
etl:block with pipeline instead ofreader:sql_queryloads upstream datasets and executes SQL- Uses CTEs, window functions, and JOINs for transformations
- Writes to
staginglayer instead ofinput - No
downloader:needed (data from upstream templates)