A deeper dive into the Redshift Warehouse - backend mechanics, table size/sortkey/distkey info, and more!
Databases within Databases
The Periscope Warehouse is a managed Redshift solution which has the ability to store data from multiple origin data sources. Each Redshift cluster can contain multiple databases, and the Periscope Warehouse stores all Periscope-relevant tables pulled in from the Cache UI in a database titled "site_#####" where the numbers indicate your Site ID.
If you're logged in via psql, you can type
SELECT datname FROM pg_database;
to see the list of databases. The relevant database in the Warehouse which is connected to Periscope can be found in the cluster info page, using the above command/query, or within Periscope's IDE, using the following query:
Periscope Warehouse Naming Conventions
How does the Warehouse enable cross-database joins between tables coming from different databases? It does so by putting all tables within the "site_####" database, so the joins are no longer actually cross database - all tables reside within the same database of the Warehouse. First, the origin database is given an ID, and the schema name is concatenated to the database ID, which combines the schema and db alias into a single schema within the "site_####" db.
Let's walk through an example.
Suppose I had a Periscope Warehouse with a db "site_1" which pulls in data from a postgres db called "production_replica," (assume the alias "db_5" is given by the Warehouse) and a table "public.daily_active_users" within "production_replica".
Within the Periscope IDE, if we write the query:
select * from production_replica.public.daily_active_users
the Periscope Warehouse will translate this query on the backend to the actual Warehouse table names :
select * from site_1.db_5_public.daily_active_users
As you can see above, the database name and schema name are concatenated together (with an underscore in between) to create a unique schema name within the warehouse.
This can be seen any time under the "Query" tab by running a select * against a table.
Warehouse Names, Sortkeys, and Distkeys
The following is a useful Periscope Warehouse query which will yield the database name (within the Periscope Warehouse), the schema name (within the origin database), and the primary sortkey and diststyle of each table in the Periscope Warehouse (specifically, the site_#### database). Additionally, the table size (GB) and a running cumulative table size are generated.
with tbl_ids as (select distinct oid from pg_class c where relowner>1 and relkind='r'), stp as (select id,sum(rows)sum_r,sum(sorted_rows)sum_sr,min(rows)min_r, max(rows)max_r,nvl(count(distinct slice),0)pop_slices from stv_tbl_perm where id in (select oid from tbl_ids) and slice<6400 group by id), colenc as (select attrelid,sum(case when a.attencodingtype=0 then 0 else 1 end) as encoded_cols,count(*)as cols from pg_attribute a where a.attrelid in (select oid from tbl_ids) and a.attnum>0 group by a.attrelid), cluster_info as (select count(distinct node) node_count from stv_slices) select ti.database, coalesce(regexp_substr(ti.schema, 'db_[0-9]+'), ti.schema) as origin_database_aliased_id, right(ti.schema, len(ti.schema) - len(regexp_substr(ti.schema, 'db_[0-9]+_'))) as schema_name, ti."table"as tablename, ti.diststyle, ti.sortkey1, round(1.0*ti.size / 1024,2) current_size_gb, sum(round(1.0*ti.size / 1024,2)) over (order by size desc rows unbounded preceding) as cumulative_size_on_disk from svv_table_info ti left join stp on stp.id=ti.table_id left join colenc on colenc.attrelid=ti.table_id cross join cluster_info where ti.schema not in('pg_internal') order by ti.size desc