name: create-data-source description: Create a new PySpark data source implementation. Use when adding a new connector, data source, or integration to the project.
Create Data Source
Overview
This skill guides you through the process of adding a new PySpark data source to the repository. A data source typically consists of:
- DataSource: The main entry point, defining capabilities and schema.
- Reader: Logic for reading data (batch/stream).
- Writer: Logic for writing data (batch/stream).
- Tests: Unit and integration tests.
- Documentation: Usage guide and API reference.
Workflow
- Define Requirements: Determine if the source supports reading, writing, or both. Is it batch or streaming?
- Implementation: Create the implementation file in
pyspark_datasources/. - Registration: Register the new source in
pyspark_datasources/__init__.py. - Dependencies: Add any required libraries to
pyproject.toml. - Testing: Create a test file in
tests/. - Documentation: Add documentation in
docs/datasources/and updatemkdocs.ymlandREADME.md.
Implementation Details
1. Create Implementation File
Create a new file pyspark_datasources/<name>.py. Use the templates in templates.md.
- Implement
DataSourceclass. - Implement
DataSourceReader(if reading). - Implement
DataSourceWriter(if writing). - Define the schema in the
DataSourceclass.
2. Register Data Source
Add the new class to pyspark_datasources/__init__.py:
from .<name> import <Name>DataSource
3. Add Dependencies
If the data source requires external libraries:
- Add them to
[project.optional-dependencies]inpyproject.toml. - Update the
allgroup to include the new dependencies.
4. Add Tests
Create tests/test_<name>.py.
- Use
unittest.mockto mock external services/libraries. - Test registration, reading, and writing logic.
- See
templates.mdfor test structure.
5. Add Documentation
- Create
docs/datasources/<name>.md. - Add the new page to
navinmkdocs.yml. - Add installation and usage examples to
README.mdanddocs/data-sources-guide.md.
Checklist
Use the checklist in checklist.md to track your progress.
Resources
- Python Data Source API Documentation
- Existing implementations in
pyspark_datasources/(e.g.,github.py,salesforce.py).