PySpark¶
Install¶
Install ibis and dependencies for the PySpark backend:
pip install 'ibis-framework[pyspark]'
conda install -c conda-forge ibis-pyspark
mamba install -c conda-forge ibis-pyspark
Connect¶
API¶
Create a client by passing in PySpark things to ibis.pyspark.connect
.
See ibis.backends.pyspark.Backend.do_connect
for connection parameter information.
ibis.pyspark.connect
is a thin wrapper around ibis.backends.pyspark.Backend.do_connect
.
Connection Parameters¶
do_connect(self, session)
¶
Create a PySpark Backend
for use with Ibis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
session |
pyspark.sql.SparkSession |
A SparkSession instance |
required |
Examples:
>>> import ibis
>>> import pyspark
>>> session = pyspark.sql.SparkSession.builder.getOrCreate()
>>> ibis.pyspark.connect(session)
Backend API¶
Backend (BaseSQLBackend)
¶
Attributes¶
current_database
property
readonly
¶
Return the name of the current database.
Backends that don't support different databases will return None.
Returns:
Type | Description |
---|---|
str | None |
Name of the current database. |
version
property
readonly
¶
Return the version of the backend engine.
For database servers, return the server version.
For others such as SQLite and pandas return the version of the underlying library or application.
Returns:
Type | Description |
---|---|
str |
The backend version |
Methods¶
add_operation(self, operation)
inherited
¶
Add a translation function to the backend for a specific operation.
Operations are defined in ibis.expr.operations
, and a translation
function receives the translator object and an expression as
parameters, and returns a value depending on the backend. For example,
in SQL backends, a NullLiteral operation could be translated to the
string "NULL"
.
Examples:
>>> @ibis.sqlite.add_operation(ibis.expr.operations.NullLiteral)
... def _null_literal(translator, expression):
... return 'NULL'
close(self)
¶
Close Spark connection and drop any temporary objects.
compile(self, expr, timecontext=None, params=None, *args, **kwargs)
¶
Compile an ibis expression to a PySpark DataFrame object.
compute_stats(self, name, database=None, noscan=False)
¶
Issue a COMPUTE STATISTICS
command for a given table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Table name |
required |
database |
str | None |
Database name |
None |
noscan |
bool |
If True, collect only basic statistics for the table (number of rows, size in bytes). |
False |
connect(self, *args, **kwargs)
inherited
¶
Connect to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
args |
None |
Connection parameters |
() |
kwargs |
None |
Additional connection parameters |
{} |
Returns:
Type | Description |
---|---|
BaseBackend |
An instance of the backend |
create_database(self, name, path=None, force=False)
¶
Create a new Spark database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Database name |
required |
path |
str | Path | None |
Path where to store the database data; otherwise uses Spark default |
None |
create_table(self, table_name, obj=None, schema=None, database=None, force=False, format='parquet')
¶
Create a new table in Spark.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name |
str |
Table name |
required |
obj |
ir.TableExpr | pd.DataFrame | None |
If passed, creates table from select statement results |
None |
schema |
sch.Schema | None |
Mutually exclusive with obj, creates an empty table with a schema |
None |
database |
str | None |
Database name |
None |
force |
bool |
If true, create table if table with indicated name already exists |
False |
format |
str |
Table format |
'parquet' |
Examples:
>>> con.create_table('new_table_name', table_expr)
create_view(self, name, expr, database=None, can_exist=False, temporary=False)
¶
Create a Spark view from a table expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
View name |
required |
expr |
ir.TableExpr |
Expression to use for the view |
required |
database |
str | None |
Database name |
None |
can_exist |
bool |
Replace an existing view of the same name if it exists |
False |
temporary |
bool |
Whether the table is temporary |
False |
database(self, name=None)
inherited
¶
Return a Database
object for the name
database.
DEPRECATED: database
is deprecated; use equivalent methods in the backend
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str | None |
Name of the database to return the object for. |
None |
Returns:
Type | Description |
---|---|
Database |
A database object for the specified database. |
drop_database(self, name, force=False)
¶
Drop a Spark database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Database name |
required |
force |
bool |
If False, Spark throws exception if database is not empty or database does not exist |
False |
drop_table(self, name, database=None, force=False)
¶
Drop a table.
drop_table_or_view(self, name, database=None, force=False)
¶
Drop a Spark table or view.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Table or view name |
required |
database |
str | None |
Database name |
None |
force |
bool |
Database may throw exception if table does not exist |
False |
Examples:
>>> table = 'my_table'
>>> db = 'operations'
>>> con.drop_table_or_view(table, db, force=True)
drop_view(self, name, database=None, force=False)
¶
Drop a view.
execute(self, expr, timecontext=None, params=None, limit='default', **kwargs)
¶
Execute an expression.
exists_database(self, name)
inherited
¶
Return whether a database name exists in the current connection.
DEPRECATED: exists_database
is deprecated as of v2.0; use name in client.list_databases()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Database to check for existence |
required |
Returns:
Type | Description |
---|---|
bool |
Whether |
exists_table(self, name, database=None)
inherited
¶
Return whether a table name exists in the database.
DEPRECATED: exists_table
is deprecated as of v2.0; use name in client.list_tables()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Table name |
required |
database |
str | None |
Database to check if given |
None |
Returns:
Type | Description |
---|---|
bool |
Whether |
explain(self, expr, params=None)
inherited
¶
Explain an expression.
Return the query plan associated with the indicated expression or SQL query.
Returns:
Type | Description |
---|---|
str |
Query plan |
insert(self, table_name, obj=None, database=None, overwrite=False, values=None, validate=True)
¶
Insert data into an existing table.
Examples:
>>> table = 'my_table'
>>> con.insert(table, table_expr)
Completely overwrite contents¶
>>> con.insert(table, table_expr, overwrite=True)
list_databases(self, like=None)
¶
List existing databases in the current connection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
like |
None |
A pattern in Python's regex format to filter returned database names. |
None |
Returns:
Type | Description |
---|---|
list[str] |
The database names that exist in the current connection, that match
the |
list_tables(self, like=None, database=None)
¶
Return the list of table names in the current database.
For some backends, the tables may be files in a directory, or other equivalent entities in a SQL database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
like |
str |
A pattern in Python's regex format. |
None |
database |
str |
The database to list tables of, if not the current one. |
None |
Returns:
Type | Description |
---|---|
list[str] |
The list of the table names that match the pattern |
raw_sql(self, query)
¶
Execute a query string.
Could have unexpected results if the query modifies the behavior of the session in a way unknown to Ibis; be careful.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str |
DML or DDL statement |
required |
Returns:
Type | Description |
---|---|
_PySparkCursor |
Backend cursor |
set_database(self, name)
¶
DEPRECATED: set_database
is deprecated as of v2.0; use a new connection to the database
sql(self, query)
inherited
¶
Convert a SQL query to an Ibis table expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str |
SQL string |
required |
Returns:
Type | Description |
---|---|
ir.TableExpr |
Table expression |
table(self, name, database=None)
¶
Return a table expression from a table or view in the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Table name |
required |
database |
str | None |
Database in which the table resides |
None |
Returns:
Type | Description |
---|---|
ir.TableExpr |
Table named |
truncate_table(self, table_name, database=None)
¶
Delete all rows from an existing table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name |
str |
Table name |
required |
database |
str | None |
Database name |
None |
verify(self, expr, params=None)
inherited
¶
Verify expr
is an expression that can be compiled.
DEPRECATED: verify
is deprecated as of v2.0; compile
and capture TranslationError
instead