PySpark
https://spark.apache.org/docs/latest/api/python
Install
Install Ibis and dependencies for the PySpark backend:
Install with the pyspark
extra:
pip install 'ibis-framework[pyspark]'
And connect:
import ibis
= ibis.pyspark.connect() con
- 1
- Adjust connection parameters as needed.
Install for PySpark:
conda install -c conda-forge ibis-pyspark
And connect:
import ibis
= ibis.pyspark.connect() con
- 1
- Adjust connection parameters as needed.
Install for PySpark:
mamba install -c conda-forge ibis-pyspark
And connect:
import ibis
= ibis.pyspark.connect() con
- 1
- Adjust connection parameters as needed.
Connect
ibis.pyspark.connect
= ibis.pyspark.connect(session=session) con
ibis.pyspark.connect
is a thin wrapper around ibis.backends.pyspark.Backend.do_connect
.
Connection Parameters
do_connect
do_connect(['self', 'session=None', "mode='batch'", '**kwargs'])
Create a PySpark Backend
for use with Ibis.
Parameters
Name | Type | Description | Default |
---|---|---|---|
session | SparkSession | None | A SparkSession instance. |
None |
mode | ConnectionMode | Can be either “batch” or “streaming”. If “batch”, every source, sink, and query executed within this connection will be interpreted as a batch workload. If “streaming”, every source, sink, and query executed within this connection will be interpreted as a streaming workload. | 'batch' |
kwargs | Additional keyword arguments used to configure the SparkSession. | {} |
Examples
>>> import ibis
>>> from pyspark.sql import SparkSession
>>> session = SparkSession.builder.getOrCreate()
>>> ibis.pyspark.connect(session)
<ibis.backends.pyspark.Backend at 0x...>
ibis.connect
URL format
In addition to ibis.pyspark.connect
, you can also connect to PySpark by passing a properly-formatted PySpark connection URL to ibis.connect
:
= ibis.connect(f"pyspark://{warehouse-dir}?spark.app.name=CountingSheep&spark.master=local[2]") con
pyspark.Backend
compile
compile(['self', 'expr', 'limit=None', 'params=None', 'pretty=False'])
Compile an Ibis expression to a SQL string.
compute_stats
compute_stats(['self', 'name', 'database=None', 'noscan=False'])
Issue a COMPUTE STATISTICS
command for a given table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | Table name | required |
database | str | None | Database name | None |
noscan | bool | If True , collect only basic statistics for the table (number of rows, size in bytes). |
False |
connect
connect(['self', '*args', '**kwargs'])
Connect to the database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
*args | Mandatory connection parameters, see the docstring of do_connect for details. |
() |
|
**kwargs | Extra connection parameters, see the docstring of do_connect for details. |
{} |
Notes
This creates a new backend instance with saved args
and kwargs
, then calls reconnect
and finally returns the newly created and connected backend instance.
Returns
Name | Type | Description |
---|---|---|
BaseBackend | An instance of the backend |
create_database
create_database(['self', 'name', '*', 'catalog=None', 'path=None', 'force=False'])
Create a new Spark database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | Database name | required |
catalog | str | None | Catalog to create database in (defaults to current_catalog ) |
None |
path | str | Path | None | Path where to store the database data; otherwise uses Spark default | None |
force | bool | Whether to append IF NOT EXISTS to the database creation SQL |
False |
create_table
create_table(['self', 'name', 'obj=None', '*', 'schema=None', 'database=None', 'temp=None', 'overwrite=False', "format='parquet'"])
Create a new table in Spark.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | Name of the new table. | required |
obj | ir.Table | pd.DataFrame | pa.Table | pl.DataFrame | pl.LazyFrame | None | If passed, creates table from SELECT statement results |
None |
schema | sch.SchemaLike | None | Mutually exclusive with obj , creates an empty table with a schema |
None |
database | str | None | Database name To specify a table in a separate catalog, you can pass in the catalog and database as a string "catalog.database" , or as a tuple of strings ("catalog", "database") . |
None |
temp | bool | None | Whether the new table is temporary (unsupported) | None |
overwrite | bool | If True , overwrite existing data |
False |
format | str | Format of the table on disk | 'parquet' |
Returns
Name | Type | Description |
---|---|---|
Table | The newly created table. |
Examples
>>> con.create_table("new_table_name", table_expr) # quartodoc: +SKIP
create_view
create_view(['self', 'name', 'obj', '*', 'database=None', 'overwrite=False'])
Create a temporary Spark view from a table expression.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | View name | required |
obj | ir.Table | Expression to use for the view | required |
database | str | None | Database name | None |
overwrite | bool | Replace an existing view of the same name if it exists | False |
Returns
Name | Type | Description |
---|---|---|
Table | The created view |
disconnect
disconnect(['self'])
Close the connection to the backend.
drop_database
drop_database(['self', 'name', '*', 'catalog=None', 'force=False'])
Drop a Spark database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | Database name | required |
catalog | str | None | Catalog containing database to drop (defaults to current_catalog ) |
None |
force | bool | If False, Spark throws exception if database is not empty or database does not exist | False |
drop_table
drop_table(['self', 'name', 'database=None', 'force=False'])
Drop a table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | Name of the table to drop. | required |
database | str | None | Name of the database where the table exists, if not the default. | None |
force | bool | If False , an exception is raised if the table does not exist. |
False |
drop_view
drop_view(['self', 'name', '*', 'database=None', 'force=False'])
Drop a view.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | Name of the view to drop. | required |
database | str | None | Name of the database where the view exists, if not the default. | None |
force | bool | If False , an exception is raised if the view does not exist. |
False |
execute
execute(['self', 'expr', 'params=None', "limit='default'", '**kwargs'])
Execute an expression.
from_connection
from_connection(['cls', 'session', "mode='batch'", '**kwargs'])
Create a PySpark Backend
from an existing SparkSession
instance.
Parameters
Name | Type | Description | Default |
---|---|---|---|
session | SparkSession | A SparkSession instance. |
required |
mode | ConnectionMode | Can be either “batch” or “streaming”. If “batch”, every source, sink, and query executed within this connection will be interpreted as a batch workload. If “streaming”, every source, sink, and query executed within this connection will be interpreted as a streaming workload. | 'batch' |
kwargs | Additional keyword arguments used to configure the SparkSession. | {} |
get_schema
get_schema(['self', 'table_name', '*', 'catalog=None', 'database=None'])
Return a Schema object for the indicated table and database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
table_name | str | Table name. May be fully qualified | required |
catalog | str | None | Catalog to use | None |
database | str | None | Database to use to get the active database. | None |
Returns
Name | Type | Description |
---|---|---|
Schema | An ibis schema |
has_operation
has_operation(['cls', 'operation'])
Return whether the backend implements support for operation
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
operation | type[ops.Value] | A class corresponding to an operation. | required |
Returns
Name | Type | Description |
---|---|---|
bool | Whether the backend implements the operation. |
Examples
>>> import ibis
>>> import ibis.expr.operations as ops
>>> ibis.sqlite.has_operation(ops.ArrayIndex)
False
>>> ibis.postgres.has_operation(ops.ArrayIndex)
True
insert
insert(['self', 'table_name', 'obj', 'database=None', 'overwrite=False'])
Insert data into a table.
schema
to refer to database hierarchy.
A collection of table
is referred to as a database
. A collection of database
is referred to as a catalog
.
These terms are mapped onto the corresponding features in each backend (where available), regardless of whether the backend itself uses the same terminology.
Parameters
Name | Type | Description | Default |
---|---|---|---|
table_name | str | The name of the table to which data needs will be inserted | required |
obj | pd.DataFrame | ir.Table | list | dict | The source data or expression to insert | required |
database | str | None | Name of the attached database that the table is located in. For backends that support multi-level table hierarchies, you can pass in a dotted string path like "catalog.database" or a tuple of strings like ("catalog", "database") . |
None |
overwrite | bool | If True then replace existing contents of table |
False |
list_catalogs
list_catalogs(['self', 'like=None'])
List existing catalogs in the current connection.
schema
to refer to database hierarchy.
A collection of table
is referred to as a database
. A collection of database
is referred to as a catalog
.
These terms are mapped onto the corresponding features in each backend (where available), regardless of whether the backend itself uses the same terminology.
Parameters
Name | Type | Description | Default |
---|---|---|---|
like | str | None | A pattern in Python’s regex format to filter returned database names. | None |
Returns
Name | Type | Description |
---|---|---|
list[str] | The catalog names that exist in the current connection, that match the like pattern if provided. |
list_databases
list_databases(['self', 'like=None', 'catalog=None'])
List existing databases in the current connection.
schema
to refer to database hierarchy.
A collection of table
is referred to as a database
. A collection of database
is referred to as a catalog
.
These terms are mapped onto the corresponding features in each backend (where available), regardless of whether the backend itself uses the same terminology.
Parameters
Name | Type | Description | Default |
---|---|---|---|
like | str | None | A pattern in Python’s regex format to filter returned database names. | None |
catalog | str | None | The catalog to list databases from. If None , the current catalog is searched. |
None |
Returns
Name | Type | Description |
---|---|---|
list[str] | The database names that exist in the current connection, that match the like pattern if provided. |
list_tables
list_tables(['self', 'like=None', 'database=None'])
List the tables in the database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
like | str | None | A pattern to use for listing tables. | None |
database | str | None | Database to list tables from. Default behavior is to show tables in the current catalog and database. To specify a table in a separate catalog, you can pass in the catalog and database as a string "catalog.database" , or as a tuple of strings ("catalog", "database") . |
None |
raw_sql
raw_sql(['self', 'query', '**kwargs'])
read_csv
read_csv(['self', 'source_list', 'table_name=None', '**kwargs'])
Register a CSV file as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
source_list | str | list[str] | tuple[str] | The data source(s). May be a path to a file or directory of CSV files, or an iterable of CSV files. | required |
table_name | str | None | An optional name to use for the created table. This defaults to a random generated name. | None |
kwargs | Any | Additional keyword arguments passed to PySpark loading function. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html | {} |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
read_csv_dir
read_csv_dir(['self', 'path', 'table_name=None', 'watermark=None', '**kwargs'])
Register a CSV directory as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path | str | Path | The data source. | required |
table_name | str | None | An optional name to use for the created table. This defaults to a random generated name. | None |
watermark | Watermark | None | Watermark strategy for the table. | None |
kwargs | Any | Additional keyword arguments passed to PySpark loading function. https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.csv.html | {} |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
read_delta
read_delta(['self', 'path', 'table_name=None', '**kwargs'])
Register a Delta Lake table as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path | str | Path | The path to the Delta Lake table. | required |
table_name | str | None | An optional name to use for the created table. This defaults to a random generated name. | None |
kwargs | Any | Additional keyword arguments passed to PySpark. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.load.html | {} |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
read_json
read_json(['self', 'source_list', 'table_name=None', '**kwargs'])
Register a JSON file as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
source_list | str | Sequence[str] | The data source(s). May be a path to a file or directory of JSON files, or an iterable of JSON files. | required |
table_name | str | None | An optional name to use for the created table. This defaults to a random generated name. | None |
kwargs | Any | Additional keyword arguments passed to PySpark loading function. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.json.html | {} |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
read_json_dir
read_json_dir(['self', 'path', 'table_name=None', 'watermark=None', '**kwargs'])
Register a JSON file as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path | str | Path | The data source. A directory of JSON files. | required |
table_name | str | None | An optional name to use for the created table. This defaults to a random generated name. | None |
watermark | Watermark | None | Watermark strategy for the table. | None |
kwargs | Any | Additional keyword arguments passed to PySpark loading function. https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.json.html | {} |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
read_kafka
read_kafka(['self', 'table_name=None', '*', 'watermark=None', 'auto_parse=False', 'schema=None', 'options=None'])
Register a Kafka topic as a table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
table_name | str | None | An optional name to use for the created table. This defaults to a sequentially generated name. | None |
watermark | Watermark | None | Watermark strategy for the table. | None |
auto_parse | bool | Whether to parse Kafka messages automatically. If False , the source is read as binary keys and values. If True , the key is discarded and the value is parsed using the provided schema. |
False |
schema | sch.Schema | None | Schema of the value of the Kafka messages. | None |
options | Mapping[str, str] | None | Additional arguments passed to PySpark as .option(“key”, “value”). https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html | None |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
read_parquet
read_parquet(['self', 'path', 'table_name=None', '**kwargs'])
Register a parquet file as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path | str | Path | The data source. May be a path to a file or directory of parquet files. | required |
table_name | str | None | An optional name to use for the created table. This defaults to a random generated name. | None |
kwargs | Any | Additional keyword arguments passed to PySpark. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.parquet.html | {} |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
read_parquet_dir
read_parquet_dir(['self', 'path', 'table_name=None', 'watermark=None', 'schema=None', '**kwargs'])
Register a parquet file as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
path | str | Path | The data source. A directory of parquet files. | required |
table_name | str | None | An optional name to use for the created table. This defaults to a random generated name. | None |
watermark | Watermark | None | Watermark strategy for the table. | None |
schema | sch.Schema | None | Schema of the parquet source. | None |
kwargs | Any | Additional keyword arguments passed to PySpark. https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.parquet.html | {} |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
reconnect
reconnect(['self'])
Reconnect to the database already configured with connect.
register
register(['self', 'source', 'table_name=None', '**kwargs'])
Register a data source as a table in the current database.
Parameters
Name | Type | Description | Default |
---|---|---|---|
source | str | Path | Any | The data source(s). May be a path to a file or directory of parquet/csv files, or an iterable of CSV files. | required |
table_name | str | None | An optional name to use for the created table. This defaults to a random generated name. | None |
**kwargs | Any | Additional keyword arguments passed to PySpark loading functions for CSV or parquet. | {} |
Returns
Name | Type | Description |
---|---|---|
ir.Table | The just-registered table |
register_options
register_options(['cls'])
Register custom backend options.
rename_table
rename_table(['self', 'old_name', 'new_name'])
Rename an existing table.
Parameters
Name | Type | Description | Default |
---|---|---|---|
old_name | str | The old name of the table. | required |
new_name | str | The new name of the table. | required |
sql
sql(['self', 'query', 'schema=None', 'dialect=None'])
table
table(['self', 'name', 'database=None'])
Construct a table expression.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | Table name | required |
database | tuple[str, str] | str | None | Database name | None |
Returns
Name | Type | Description |
---|---|---|
Table | Table expression |
to_csv
to_csv(['self', 'expr', 'path', '*', 'params=None', '**kwargs'])
Write the results of executing the given expression to a CSV file.
This method is eager and will execute the associated expression immediately.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Table | The ibis expression to execute and persist to CSV. | required |
path | str | Path | The data source. A string or Path to the CSV file. | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
kwargs | Any | Additional keyword arguments passed to pyarrow.csv.CSVWriter | {} |
https | required |
to_csv_dir
to_csv_dir(['self', 'expr', 'path', 'params=None', 'limit=None', 'options=None'])
Write the results of executing the given expression to a CSV directory.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | The ibis expression to execute and persist to CSV. | required |
path | str | Path | The data source. A string or Path to the CSV directory. | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
limit | int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
options | Mapping[str, str] | None | Additional keyword arguments passed to pyspark.sql.streaming.DataStreamWriter | None |
Returns
Name | Type | Description |
---|---|---|
StreamingQuery | None | Returns a Pyspark StreamingQuery object if in streaming mode, otherwise None |
to_delta
to_delta(['self', 'expr', 'path', 'params=None', 'limit=None', '**kwargs'])
Write the results of executing the given expression to a Delta Lake table.
This method is eager and will execute the associated expression immediately.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Table | The ibis expression to execute and persist to a Delta Lake table. | required |
path | str | Path | The data source. A string or Path to the Delta Lake table. | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
limit | int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
**kwargs | Any | Additional keyword arguments passed to pyspark.sql.DataFrameWriter. | {} |
to_kafka
to_kafka(['self', 'expr', '*', 'auto_format=False', 'options=None', 'params=None', "limit='default'"])
Write the results of executing the given expression to a Kafka topic.
This method does not return outputs. Streaming queries are run continuously in the background.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | The ibis expression to execute and persist to a Kafka topic. | required |
auto_format | bool | Whether to format the Kafka messages before writing. If False , the output is written as-is. If True , the output is converted into JSON and written as the value of the Kafka messages. |
False |
options | Mapping[str, str] | None | PySpark Kafka write arguments. https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html | None |
params | Mapping | None | Mapping of scalar parameter expressions to value. | None |
limit | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
'default' |
Returns
Name | Type | Description |
---|---|---|
StreamingQuery | A Pyspark StreamingQuery object |
to_pandas
to_pandas(['self', 'expr', '*', 'params=None', 'limit=None', '**kwargs'])
Execute an Ibis expression and return a pandas DataFrame
, Series
, or scalar.
This method is a wrapper around execute
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | Ibis expression to execute. | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
limit | int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
kwargs | Any | Keyword arguments | {} |
to_pandas_batches
to_pandas_batches(['self', 'expr', '*', 'params=None', 'limit=None', 'chunk_size=1000000', '**kwargs'])
Execute an Ibis expression and return an iterator of pandas DataFrame
s.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | Ibis expression to execute. | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
limit | int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
chunk_size | int | Maximum number of rows in each returned DataFrame batch. This may have no effect depending on the backend. |
1000000 |
kwargs | Any | Keyword arguments | {} |
Returns
Name | Type | Description |
---|---|---|
Iterator[pd.DataFrame] | An iterator of pandas DataFrame s. |
to_parquet
to_parquet(['self', 'expr', 'path', '*', 'params=None', '**kwargs'])
Write the results of executing the given expression to a parquet file.
This method is eager and will execute the associated expression immediately.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Table | The ibis expression to execute and persist to parquet. | required |
path | str | Path | The data source. A string or Path to the parquet file. | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
**kwargs | Any | Additional keyword arguments passed to pyarrow.parquet.ParquetWriter | {} |
https | required |
to_parquet_dir
to_parquet_dir(['self', 'expr', 'path', 'params=None', 'limit=None', 'options=None'])
Write the results of executing the given expression to a parquet directory.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | The ibis expression to execute and persist to parquet. | required |
path | str | Path | The data source. A string or Path to the parquet directory. | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
limit | int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
options | Mapping[str, str] | None | Additional keyword arguments passed to pyspark.sql.streaming.DataStreamWriter | None |
Returns
Name | Type | Description |
---|---|---|
StreamingQuery | None | Returns a Pyspark StreamingQuery object if in streaming mode, otherwise None |
to_polars
to_polars(['self', 'expr', '*', 'params=None', 'limit=None', '**kwargs'])
Execute expression and return results in as a polars DataFrame.
This method is eager and will execute the associated expression immediately.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | Ibis expression to export to polars. | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
limit | int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
kwargs | Any | Keyword arguments | {} |
Returns
Name | Type | Description |
---|---|---|
dataframe | A polars DataFrame holding the results of the executed expression. |
to_pyarrow
to_pyarrow(['self', 'expr', 'params=None', 'limit=None', '**kwargs'])
Execute expression and return results in as a pyarrow table.
This method is eager and will execute the associated expression immediately.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | Ibis expression to export to pyarrow | required |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
limit | int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
kwargs | Any | Keyword arguments | {} |
Returns
Name | Type | Description |
---|---|---|
Table | A pyarrow table holding the results of the executed expression. |
to_pyarrow_batches
to_pyarrow_batches(['self', 'expr', '*', 'params=None', 'limit=None', 'chunk_size=1000000', '**kwargs'])
Execute expression and return an iterator of pyarrow record batches.
This method is eager and will execute the associated expression immediately.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | Ibis expression to export to pyarrow | required |
limit | int | str | None | An integer to effect a specific row limit. A value of None means “no limit”. The default is in ibis/config.py . |
None |
params | Mapping[ir.Scalar, Any] | None | Mapping of scalar parameter expressions to value. | None |
chunk_size | int | Maximum number of rows in each returned record batch. | 1000000 |
Returns
Name | Type | Description |
---|---|---|
RecordBatchReader | Collection of pyarrow RecordBatch s. |
to_torch
to_torch(['self', 'expr', '*', 'params=None', 'limit=None', '**kwargs'])
Execute an expression and return results as a dictionary of torch tensors.
Parameters
Name | Type | Description | Default |
---|---|---|---|
expr | ir.Expr | Ibis expression to execute. | required |
params | Mapping[ir.Scalar, Any] | None | Parameters to substitute into the expression. | None |
limit | int | str | None | An integer to effect a specific row limit. A value of None means no limit. |
None |
kwargs | Any | Keyword arguments passed into the backend’s to_torch implementation. |
{} |
Returns
Name | Type | Description |
---|---|---|
dict[str, torch.Tensor] | A dictionary of torch tensors, keyed by column name. |
truncate_table
truncate_table(['self', 'name', 'database=None'])
Delete all rows from a table.
schema
to refer to database hierarchy.
A collection of tables is referred to as a database
. A collection of database
is referred to as a catalog
. These terms are mapped onto the corresponding features in each backend (where available), regardless of whether the backend itself uses the same terminology.
Parameters
Name | Type | Description | Default |
---|---|---|---|
name | str | Table name | required |
database | str | None | Name of the attached database that the table is located in. For backends that support multi-level table hierarchies, you can pass in a dotted string path like "catalog.database" or a tuple of strings like ("catalog", "database") . |
None |