DataFusion
https://arrow.apache.org/datafusion
This backend is experimental and is subject to backwards incompatible changes.
Install
Install Ibis and dependencies for the Apache DataFusion backend:
Install with the Apache datafusion
extra:
pip install 'ibis-framework[datafusion]'
And connect:
import ibis
= ibis.datafusion.connect() con
- 1
- Adjust connection parameters as needed.
Install for Apache DataFusion:
conda install -c conda-forge ibis-datafusion
And connect:
import ibis
= ibis.datafusion.connect() con
- 1
- Adjust connection parameters as needed.
Install for Apache DataFusion:
mamba install -c conda-forge ibis-datafusion
And connect:
import ibis
= ibis.datafusion.connect() con
- 1
- Adjust connection parameters as needed.
Connect
ibis.datafusion.connect
= ibis.datafusion.connect() con
= ibis.datafusion.connect(
con ={"table1": "path/to/file.parquet", "table2": "path/to/file.csv"}
config )
ibis.datafusion.connect
is a thin wrapper around ibis.backends.datafusion.Backend.do_connect
.
Connection Parameters
do_connect
do_connect(self, config=None)
Create a Datafusion backend for use with Ibis.
Parameters
Name | Type | Description | Default |
---|---|---|---|
config |
Mapping[str, str | Path] | SessionContext | None | Mapping of table names to files. | None |
Examples
>>> import ibis
>>> config = {"t": "path/to/file.parquet", "s": "path/to/file.csv"}
>>> ibis.datafusion.connect(config)
datafusion.Backend
compile
compile(self, expr, params=None, **kwargs)
Compile an expression.
create_database
create_database(self, name, force=False)
Create a new database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | Name of the new database. | required |
force |
bool | If False , an exception is raised if the database already exists. |
False |
create_schema
create_schema(self, name, database=None, force=False)
Create a schema named name
in database
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | Name of the schema to create. | required |
database |
str | None | Name of the database in which to create the schema. If None , the current database is used. |
None |
force |
bool | If False , an exception is raised if the schema exists. |
False |
create_table
create_table(self, *_, **__)
Create a new table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | Name of the new table. | required |
obj |
pd.DataFrame | pa.Table | ir.Table | None | An Ibis table expression or pandas table that will be used to extract the schema and the data of the new table. If not provided, schema must be given. |
None |
schema |
ibis.Schema | None | The schema for the new table. Only one of schema or obj can be provided. |
None |
database |
str | None | Name of the database where the table will be created, if not the default. | None |
temp |
bool | Whether a table is temporary or not | False |
overwrite |
bool | Whether to clobber existing data | False |
Returns
Type | Description |
---|---|
Table | The table that was created. |
create_view
create_view(self, *_, **__)
Create a new view from an expression.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | Name of the new view. | required |
obj |
ir.Table | An Ibis table expression that will be used to create the view. | required |
database |
str | None | Name of the database where the view will be created, if not provided the database’s default is used. | None |
overwrite |
bool | Whether to clobber an existing view with the same name | False |
Returns
Type | Description |
---|---|
Table | The view that was created. |
drop_database
drop_database(self, name, force=False)
Drop a database with name name
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | Database to drop. | required |
force |
bool | If False , an exception is raised if the database does not exist. |
False |
drop_schema
drop_schema(self, name, database=None, force=False)
Drop the schema with name
in database
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | Name of the schema to drop. | required |
database |
str | None | Name of the database to drop the schema from. If None , the current database is used. |
None |
force |
bool | If False , an exception is raised if the schema does not exist. |
False |
drop_table
drop_table(self, *_, **__)
Drop a table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | Name of the table to drop. | required |
database |
str | None | Name of the database where the table exists, if not the default. | None |
force |
bool | If False , an exception is raised if the table does not exist. |
False |
drop_view
drop_view(self, *_, **__)
Drop a view.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | Name of the view to drop. | required |
database |
str | None | Name of the database where the view exists, if not the default. | None |
force |
bool | If False , an exception is raised if the view does not exist. |
False |
execute
execute(self, expr, params=None, limit='default', **kwargs)
Execute an expression.
has_operation
has_operation(cls, operation)
Return whether the backend implements support for operation
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
operation |
type[ops.Value] | A class corresponding to an operation. | required |
Returns
Type | Description |
---|---|
bool | Whether the backend implements the operation. |
Examples
>>> import ibis
>>> import ibis.expr.operations as ops
>>> ibis.sqlite.has_operation(ops.ArrayIndex)
False
>>> ibis.postgres.has_operation(ops.ArrayIndex)
True
list_databases
list_databases(self, like=None)
List existing databases in the current connection.
Parameters
Name | Type | Description | Default |
---|---|---|---|
like |
str | None | A pattern in Python’s regex format to filter returned database names. | None |
Returns
Type | Description |
---|---|
list[str] | The database names that exist in the current connection, that match the like pattern if provided. |
list_schemas
list_schemas(self, like=None, database=None)
List existing schemas in the current connection.
Parameters
Name | Type | Description | Default |
---|---|---|---|
like |
str | None | A pattern in Python’s regex format to filter returned schema names. | None |
database |
str | None | The database to list schemas from. If None , the current database is searched. |
None |
Returns
Type | Description |
---|---|
list[str] | The schema names that exist in the current connection, that match the like pattern if provided. |
list_tables
list_tables(self, like=None, database=None)
List the available tables.
read_csv
read_csv(self, path, table_name=None, **kwargs)
Register a CSV file as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path |
str | Path | The data source. A string or Path to the CSV file. | required |
table_name |
str | None | An optional name to use for the created table. This defaults to a sequentially generated name. | None |
**kwargs |
Any | Additional keyword arguments passed to Datafusion loading function. | {} |
Returns
Type | Description |
---|---|
ir.Table | The just-registered table |
read_delta
read_delta(self, source_table, table_name=None, **kwargs)
Register a Delta Lake table as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
source_table |
str | Path | The data source. Must be a directory containing a Delta Lake table. | required |
table_name |
str | None | An optional name to use for the created table. This defaults to a sequentially generated name. | None |
**kwargs |
Any | Additional keyword arguments passed to deltalake.DeltaTable. | {} |
Returns
Type | Description |
---|---|
ir.Table | The just-registered table |
read_parquet
read_parquet(self, path, table_name=None, **kwargs)
Register a parquet file as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path |
str | Path | The data source. | required |
table_name |
str | None | An optional name to use for the created table. This defaults to a sequentially generated name. | None |
**kwargs |
Any | Additional keyword arguments passed to Datafusion loading function. | {} |
Returns
Type | Description |
---|---|
ir.Table | The just-registered table |
register
register(self, source, table_name=None, **kwargs)
Register a data set with table_name
located at source
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
source |
str | Path | pa.Table | pa.RecordBatch | pa.Dataset | pd.DataFrame | The data source(s). May be a path to a file or directory of parquet/csv files, a pandas dataframe, or a pyarrow table, dataset or record batch. | required |
table_name |
str | None | The name of the table | None |
kwargs |
Any | Datafusion-specific keyword arguments | {} |
Examples
Register a csv:
>>> import ibis
>>> conn = ibis.datafusion.connect(config)
>>> conn.register("path/to/data.csv", "my_table")
>>> conn.table("my_table")
Register a PyArrow table:
>>> import pyarrow as pa
>>> tab = pa.table({"x": [1, 2, 3]})
>>> conn.register(tab, "my_table")
>>> conn.table("my_table")
Register a PyArrow dataset:
>>> import pyarrow.dataset as ds
>>> dataset = ds.dataset("path/to/table")
>>> conn.register(dataset, "my_table")
>>> conn.table("my_table")
table
table(self, name, schema=None)
Get an ibis expression representing a DataFusion table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name |
str | The name of the table to retrieve | required |
schema |
sch.Schema | None | An optional schema for the table | None |
Returns
Type | Description |
---|---|
Table | A table expression |
to_pyarrow_batches
to_pyarrow_batches(self, expr, *, params=None, limit=None, chunk_size=1000000, **kwargs)
Execute expression and return a RecordBatchReader.
This method is eager and will execute the associated expression immediately.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr |
ir.Expr | Ibis expression to export to pyarrow | required |
limit |
int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
params |
Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
chunk_size |
int | Maximum number of rows in each returned record batch. | 1000000 |
kwargs |
Any | Keyword arguments | {} |
Returns
Type | Description |
---|---|
results | RecordBatchReader |