A deeper dive into the Redshift Warehouse - backend mechanics, table size/sortkey/distkey info, and more!

Databases within Databases

The Periscope Warehouse is a managed Redshift solution which has the ability to store data from multiple origin data sources. Each Redshift cluster can contain multiple databases, and the Periscope Warehouse stores all Periscope-relevant tables pulled in from the Cache UI in a database titled "site_#####" where the numbers indicate your Site ID. 

If you're logged in via psql, you can type 




SELECT datname FROM pg_database;

to see the list of databases. The relevant database in the Warehouse which is connected to Periscope can be found in the cluster info page, using the above command/query, or within Periscope's IDE, using the following query:

select current_database()


Periscope Warehouse Naming Conventions

How does the Warehouse enable cross-database joins between tables coming from different databases? It does so by putting all tables within the "site_####" database, so the joins are no longer actually cross database - all tables reside within the same database of the Warehouse. First, the origin database is given an ID, and the schema name is concatenated to the database ID, which combines the schema and db alias into a single schema within the "site_####" db.

Let's walk through an example.

Suppose I had a Periscope Warehouse with a db "site_1" which pulls in data from a postgres db called "production_replica," (assume the alias "db_5" is given by the Warehouse) and a table "public.daily_active_users" within "production_replica". 

Within the Periscope IDE, if we write the query: 

select * from production_replica.public.daily_active_users

the Periscope Warehouse will translate this query on the backend to the actual Warehouse table names :

select * from site_1.db_5_public.daily_active_users

As you can see above, the database name and schema name are concatenated together (with an underscore in between) to create a unique schema name within the warehouse.

This can be seen any time under the "Query" tab by running a select * against a table.



Warehouse Names, Sortkeys, and Distkeys

The following is a useful Periscope Warehouse query which will yield the database name (within the Periscope Warehouse), the schema name (within the origin database), and the primary sortkey and diststyle of each table in the Periscope Warehouse (specifically, the site_#### database). Additionally, the table size (GB) and a running cumulative table size are generated. 

  tbl_ids as
    (select distinct oid
     from pg_class c
     where relowner>1
       and relkind='r'),
  stp as
    (select id,sum(rows)sum_r,sum(sorted_rows)sum_sr,min(rows)min_r,
       max(rows)max_r,nvl(count(distinct slice),0)pop_slices
     from stv_tbl_perm
     where id in (select oid from tbl_ids)
       and slice<6400
     group by id),
  colenc as
    (select attrelid,sum(case when a.attencodingtype=0 then 0 else 1 end)
        as encoded_cols,count(*)as cols
     from pg_attribute a
     where a.attrelid in (select oid from tbl_ids)
       and a.attnum>0
     group by a.attrelid),
  cluster_info as
    (select count(distinct node) node_count
     from stv_slices)
       coalesce(regexp_substr(ti.schema, 'db_[0-9]+'), ti.schema) as origin_database_aliased_id,
       right(ti.schema, len(ti.schema) - len(regexp_substr(ti.schema, 'db_[0-9]+_')))  as schema_name,
       ti."table"as tablename,
       round(1.0*ti.size / 1024,2) current_size_gb,
       sum(round(1.0*ti.size / 1024,2)) over (order by size desc rows unbounded preceding) as cumulative_size_on_disk

from svv_table_info ti
left join stp on stp.id=ti.table_id
left join colenc on colenc.attrelid=ti.table_id
cross join cluster_info
where ti.schema not in('pg_internal')
order by ti.size desc

Reply Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
reply to topic
Like1 Follow
  • 11 days agoLast active
  • 812Views
  • 1 Following