Skip to content

PySpark

Install

Install ibis and dependencies for the PySpark backend:

pip install 'ibis-framework[pyspark]'
conda install -c conda-forge ibis-pyspark
mamba install -c conda-forge ibis-pyspark

Connect

API

Create a client by passing in PySpark things to ibis.pyspark.connect.

See ibis.backends.pyspark.Backend.do_connect for connection parameter information.

ibis.pyspark.connect is a thin wrapper around ibis.backends.pyspark.Backend.do_connect.

Connection Parameters

do_connect(self, session)

Create a PySpark Backend for use with Ibis.

Parameters:

Name Type Description Default
session pyspark.sql.SparkSession

A SparkSession instance

required

Examples:

>>> import ibis
>>> import pyspark
>>> session = pyspark.sql.SparkSession.builder.getOrCreate()
>>> ibis.pyspark.connect(session)

Backend API

Backend (BaseSQLBackend)

Attributes

current_database property readonly

Return the name of the current database.

Backends that don't support different databases will return None.

Returns:

Type Description
str | None

Name of the current database.

version property readonly

Return the version of the backend engine.

For database servers, return the server version.

For others such as SQLite and pandas return the version of the underlying library or application.

Returns:

Type Description
str

The backend version

Methods

add_operation(self, operation) inherited

Add a translation function to the backend for a specific operation.

Operations are defined in ibis.expr.operations, and a translation function receives the translator object and an expression as parameters, and returns a value depending on the backend. For example, in SQL backends, a NullLiteral operation could be translated to the string "NULL".

Examples:

>>> @ibis.sqlite.add_operation(ibis.expr.operations.NullLiteral)
... def _null_literal(translator, expression):
...     return 'NULL'
close(self)

Close Spark connection and drop any temporary objects.

compile(self, expr, timecontext=None, params=None, *args, **kwargs)

Compile an ibis expression to a PySpark DataFrame object.

compute_stats(self, name, database=None, noscan=False)

Issue a COMPUTE STATISTICS command for a given table.

Parameters:

Name Type Description Default
name str

Table name

required
database str | None

Database name

None
noscan bool

If True, collect only basic statistics for the table (number of rows, size in bytes).

False
connect(self, *args, **kwargs) inherited

Connect to the database.

Parameters:

Name Type Description Default
args None

Connection parameters

()
kwargs None

Additional connection parameters

{}

Returns:

Type Description
BaseBackend

An instance of the backend

create_database(self, name, path=None, force=False)

Create a new Spark database.

Parameters:

Name Type Description Default
name str

Database name

required
path str | Path | None

Path where to store the database data; otherwise uses Spark default

None
create_table(self, table_name, obj=None, schema=None, database=None, force=False, format='parquet')

Create a new table in Spark.

Parameters:

Name Type Description Default
table_name str

Table name

required
obj ir.TableExpr | pd.DataFrame | None

If passed, creates table from select statement results

None
schema sch.Schema | None

Mutually exclusive with obj, creates an empty table with a schema

None
database str | None

Database name

None
force bool

If true, create table if table with indicated name already exists

False
format str

Table format

'parquet'

Examples:

>>> con.create_table('new_table_name', table_expr)
create_view(self, name, expr, database=None, can_exist=False, temporary=False)

Create a Spark view from a table expression.

Parameters:

Name Type Description Default
name str

View name

required
expr ir.TableExpr

Expression to use for the view

required
database str | None

Database name

None
can_exist bool

Replace an existing view of the same name if it exists

False
temporary bool

Whether the table is temporary

False
database(self, name=None) inherited

Return a Database object for the name database.

DEPRECATED: database is deprecated; use equivalent methods in the backend

Parameters:

Name Type Description Default
name str | None

Name of the database to return the object for.

None

Returns:

Type Description
Database

A database object for the specified database.

drop_database(self, name, force=False)

Drop a Spark database.

Parameters:

Name Type Description Default
name str

Database name

required
force bool

If False, Spark throws exception if database is not empty or database does not exist

False
drop_table(self, name, database=None, force=False)

Drop a table.

drop_table_or_view(self, name, database=None, force=False)

Drop a Spark table or view.

Parameters:

Name Type Description Default
name str

Table or view name

required
database str | None

Database name

None
force bool

Database may throw exception if table does not exist

False

Examples:

>>> table = 'my_table'
>>> db = 'operations'
>>> con.drop_table_or_view(table, db, force=True)
drop_view(self, name, database=None, force=False)

Drop a view.

execute(self, expr, timecontext=None, params=None, limit='default', **kwargs)

Execute an expression.

exists_database(self, name) inherited

Return whether a database name exists in the current connection.

DEPRECATED: exists_database is deprecated as of v2.0; use name in client.list_databases()

Parameters:

Name Type Description Default
name str

Database to check for existence

required

Returns:

Type Description
bool

Whether name exists

exists_table(self, name, database=None) inherited

Return whether a table name exists in the database.

DEPRECATED: exists_table is deprecated as of v2.0; use name in client.list_tables()

Parameters:

Name Type Description Default
name str

Table name

required
database str | None

Database to check if given

None

Returns:

Type Description
bool

Whether name is a table

explain(self, expr, params=None) inherited

Explain an expression.

Return the query plan associated with the indicated expression or SQL query.

Returns:

Type Description
str

Query plan

insert(self, table_name, obj=None, database=None, overwrite=False, values=None, validate=True)

Insert data into an existing table.

Examples:

>>> table = 'my_table'
>>> con.insert(table, table_expr)
Completely overwrite contents
>>> con.insert(table, table_expr, overwrite=True)
list_databases(self, like=None)

List existing databases in the current connection.

Parameters:

Name Type Description Default
like None

A pattern in Python's regex format to filter returned database names.

None

Returns:

Type Description
list[str]

The database names that exist in the current connection, that match the like pattern if provided.

list_tables(self, like=None, database=None)

Return the list of table names in the current database.

For some backends, the tables may be files in a directory, or other equivalent entities in a SQL database.

Parameters:

Name Type Description Default
like str

A pattern in Python's regex format.

None
database str

The database to list tables of, if not the current one.

None

Returns:

Type Description
list[str]

The list of the table names that match the pattern like.

raw_sql(self, query)

Execute a query string.

Could have unexpected results if the query modifies the behavior of the session in a way unknown to Ibis; be careful.

Parameters:

Name Type Description Default
query str

DML or DDL statement

required

Returns:

Type Description
_PySparkCursor

Backend cursor

set_database(self, name)

DEPRECATED: set_database is deprecated as of v2.0; use a new connection to the database

sql(self, query) inherited

Convert a SQL query to an Ibis table expression.

Parameters:

Name Type Description Default
query str

SQL string

required

Returns:

Type Description
ir.TableExpr

Table expression

table(self, name, database=None)

Return a table expression from a table or view in the database.

Parameters:

Name Type Description Default
name str

Table name

required
database str | None

Database in which the table resides

None

Returns:

Type Description
ir.TableExpr

Table named name from database

truncate_table(self, table_name, database=None)

Delete all rows from an existing table.

Parameters:

Name Type Description Default
table_name str

Table name

required
database str | None

Database name

None
verify(self, expr, params=None) inherited

Verify expr is an expression that can be compiled.

DEPRECATED: verify is deprecated as of v2.0; compile and capture TranslationError instead


Last update: March 1, 2022