Skip to content

PySpark

Install

Install ibis and dependencies for the PySpark backend:

pip install 'ibis-framework[pyspark]'
conda install -c conda-forge ibis-pyspark
mamba install -c conda-forge ibis-pyspark

Connect

API

Create a client by passing in PySpark things to ibis.pyspark.connect.

See ibis.backends.pyspark.Backend.do_connect for connection parameter information.

ibis.pyspark.connect is a thin wrapper around ibis.backends.pyspark.Backend.do_connect.

Connection Parameters

do_connect(self, session)

Create a PySpark Backend for use with Ibis.

Parameters:

Name Type Description Default
session SparkSession

A SparkSession instance

required

Examples:

>>> import ibis
>>> from pyspark.sql import SparkSession
>>> session = SparkSession.builder.getOrCreate()
>>> ibis.pyspark.connect(session)
<ibis.backends.pyspark.Backend at 0x...>

Backend API

Backend (BaseSQLBackend)

Attributes

current_database property readonly

Return the name of the current database.

Backends that don't support different databases will return None.

Returns:

Type Description
str | None

Name of the current database.

db_identity: str cached inherited property writable

Return the identity of the database.

Multiple connections to the same database will return the same value for db_identity.

The default implementation assumes connection parameters uniquely specify the database.

Returns:

Type Description
str

Database identity

tables cached inherited property writable

An accessor for tables in the database.

Tables may be accessed by name using either index or attribute access:

Examples:

>>> con = ibis.sqlite.connect("example.db")
>>> people = con.tables['people']  # access via index
>>> people = con.tables.people  # access via attribute
version property readonly

Return the version of the backend engine.

For database servers, return the server version.

For others such as SQLite and pandas return the version of the underlying library or application.

Returns:

Type Description
str

The backend version

Classes

Options (BaseModel) pydantic-model
Attributes
treat_nan_as_null: bool pydantic-field

Treat NaNs in floating point expressions as NULL.

Methods

add_operation(self, operation) inherited

Add a translation function to the backend for a specific operation.

Operations are defined in ibis.expr.operations, and a translation function receives the translator object and an expression as parameters, and returns a value depending on the backend. For example, in SQL backends, a NullLiteral operation could be translated to the string "NULL".

Examples:

>>> @ibis.sqlite.add_operation(ibis.expr.operations.NullLiteral)
... def _null_literal(translator, expression):
...     return 'NULL'
close(self)

Close Spark connection and drop any temporary objects.

compile(self, expr, timecontext=None, params=None, *args, **kwargs)

Compile an ibis expression to a PySpark DataFrame object.

compute_stats(self, name, database=None, noscan=False)

Issue a COMPUTE STATISTICS command for a given table.

Parameters:

Name Type Description Default
name str

Table name

required
database str | None

Database name

None
noscan bool

If True, collect only basic statistics for the table (number of rows, size in bytes).

False
connect(self, *args, **kwargs) inherited

Connect to the database.

Parameters:

Name Type Description Default
args None

Connection parameters

()
kwargs None

Additional connection parameters

{}

Returns:

Type Description
BaseBackend

An instance of the backend

create_database(self, name, path=None, force=False)

Create a new Spark database.

Parameters:

Name Type Description Default
name str

Database name

required
path str | Path | None

Path where to store the database data; otherwise uses Spark default

None
create_table(self, table_name, obj=None, schema=None, database=None, force=False, format='parquet')

Create a new table in Spark.

Parameters:

Name Type Description Default
table_name str

Table name

required
obj ir.Table | pd.DataFrame | None

If passed, creates table from select statement results

None
schema sch.Schema | None

Mutually exclusive with obj, creates an empty table with a schema

None
database str | None

Database name

None
force bool

If true, create table if table with indicated name already exists

False
format str

Table format

'parquet'

Examples:

>>> con.create_table('new_table_name', table_expr)
create_view(self, name, expr, database=None, can_exist=False, temporary=False)

Create a Spark view from a table expression.

Parameters:

Name Type Description Default
name str

View name

required
expr ir.Table

Expression to use for the view

required
database str | None

Database name

None
can_exist bool

Replace an existing view of the same name if it exists

False
temporary bool

Whether the table is temporary

False
database(self, name=None) inherited

Return a Database object for the name database.

DEPRECATED: database is deprecated; use equivalent methods in the backend

Parameters:

Name Type Description Default
name str | None

Name of the database to return the object for.

None

Returns:

Type Description
Database

A database object for the specified database.

drop_database(self, name, force=False)

Drop a Spark database.

Parameters:

Name Type Description Default
name str

Database name

required
force bool

If False, Spark throws exception if database is not empty or database does not exist

False
drop_table(self, name, database=None, force=False)

Drop a table.

drop_table_or_view(self, name, database=None, force=False)

Drop a Spark table or view.

Parameters:

Name Type Description Default
name str

Table or view name

required
database str | None

Database name

None
force bool

Database may throw exception if table does not exist

False

Examples:

>>> table = 'my_table'
>>> db = 'operations'
>>> con.drop_table_or_view(table, db, force=True)
drop_view(self, name, database=None, force=False)

Drop a view.

execute(self, expr, timecontext=None, params=None, limit='default', **kwargs)

Execute an expression.

exists_database(self, name) inherited

Return whether a database name exists in the current connection.

DEPRECATED: exists_database is deprecated as of v2.0; use name in client.list_databases()

Parameters:

Name Type Description Default
name str

Database to check for existence

required

Returns:

Type Description
bool

Whether name exists

exists_table(self, name, database=None) inherited

Return whether a table name exists in the database.

DEPRECATED: exists_table is deprecated as of v2.0; use name in client.list_tables()

Parameters:

Name Type Description Default
name str

Table name

required
database str | None

Database to check if given

None

Returns:

Type Description
bool

Whether name is a table

explain(self, expr, params=None) inherited

Explain an expression.

Return the query plan associated with the indicated expression or SQL query.

Returns:

Type Description
str

Query plan

has_operation(cls, operation)

Return whether the backend implements support for operation.

Parameters:

Name Type Description Default
operation type[ops.Value]

A class corresponding to an operation.

required

Examples:

>>> import ibis
>>> import ibis.expr.operations as ops
>>> ibis.sqlite.has_operation(ops.ArrayIndex)
False
>>> ibis.postgres.has_operation(ops.ArrayIndex)
True

Returns:

Type Description
bool

Whether the backend implements the operation.

insert(self, table_name, obj=None, database=None, overwrite=False, values=None, validate=True)

Insert data into an existing table.

Examples:

>>> table = 'my_table'
>>> con.insert(table, table_expr)
Completely overwrite contents
>>> con.insert(table, table_expr, overwrite=True)
list_databases(self, like=None)

List existing databases in the current connection.

Parameters:

Name Type Description Default
like None

A pattern in Python's regex format to filter returned database names.

None

Returns:

Type Description
list[str]

The database names that exist in the current connection, that match the like pattern if provided.

list_tables(self, like=None, database=None)

Return the list of table names in the current database.

For some backends, the tables may be files in a directory, or other equivalent entities in a SQL database.

Parameters:

Name Type Description Default
like str

A pattern in Python's regex format.

None
database str

The database to list tables of, if not the current one.

None

Returns:

Type Description
list[str]

The list of the table names that match the pattern like.

raw_sql(self, query)

Execute a query string.

Could have unexpected results if the query modifies the behavior of the session in a way unknown to Ibis; be careful.

Parameters:

Name Type Description Default
query str

DML or DDL statement

required

Returns:

Type Description
_PySparkCursor

Backend cursor

set_database(self, name)

DEPRECATED: set_database is deprecated as of v2.0; use a new connection to the database

sql(self, query) inherited

Convert a SQL query to an Ibis table expression.

Parameters:

Name Type Description Default
query str

SQL string

required

Returns:

Type Description
ir.Table

Table expression

table(self, name, database=None)

Return a table expression from a table or view in the database.

Parameters:

Name Type Description Default
name str

Table name

required
database str | None

Database in which the table resides

None

Returns:

Type Description
ir.Table

Table named name from database

truncate_table(self, table_name, database=None)

Delete all rows from an existing table.

Parameters:

Name Type Description Default
table_name str

Table name

required
database str | None

Database name

None
verify(self, expr, params=None) inherited

Verify expr is an expression that can be compiled.

DEPRECATED: verify is deprecated as of v2.0; compile and capture TranslationError instead


Last update: March 1, 2022