Upserting data with existing many-to-many connections

This is a strategy question - but one I've been grappling with all week and could really use some wisdom-as-a-service.


I am trying to piece together a clean, actionable PostgreSQL database with sanitized records from a ton of messy, noisy raw source tables.

Say we have contacts, companies, stores. Many contacts can work at a company. Multiple companies can be involved in a store (owner, maintenance, etc.) There are many-to-many (M:N) junctions between contacts_companies and companies_stores.

Table relationships:

contacts >-< contacts_companies >-< companies >-< companies_stores >-< stores

unique on: contacts:email, companies:name, stores:formatted_address

The multiple data sources have all sorts of overlaps between the above tables. Some have just store info, some have some contact and store info, etc. As I go through each data source, data will be cleansed and then upserted to the appropriate table.


The upsert operation is the problem. Inserting with an ON CONFLICT is fine for a specific table, but what about the junctions between?

To illustrate the problem, say we have a data source that has the following records:

{owner: "Alice A.", email: "A@Alice_stores.net", store_address:"1 main st"}

This record has

  • a contact {name: Alice A, email: A@Alice_stores.net},
  • a company {name: Alice Stores}
  • a store {address: 1 main st}

The database might already have a store for 1 main st, but not a contact/company for Alice.

If we individually upsert the contact, company and store, we won't know what proper IDs to place in the junction tables:

  • If INSERT, use new id for FK in junction.
  • If Update, use existing id for fk in junction.


What is the best strategy for upserting multiple tables from various sources when there are M:N relationships across the board?

One strategy that occurred to me is to query all tables/junctions, hold them in memory and perform all conditionals/updates in memory and then have a more forceful upsert. But this can quickly get out of hand...


One way could be using returning clause with cte to make all upserts in one statement. For illustration purposes I use 1 many-to-many relationship.

--- test tables with m2m 
create table test1(test1_id bigserial not null primary key, uq_value1 text ,constraint unique_uq_val1 unique (uq_value1));

create table test2(test2_id bigserial not null primary key, uq_value2 text ,constraint unique_uq_val2 unique (uq_value2));

create table test1_test2 (test1_id bigint not null, test2_id bigint not null, 
primary key (test1_id, test2_id),
constraint fk_test1  foreign key (test1_id) references test1(test1_id ),
constraint fk_test2  foreign key (test2_id) references test2(test2_id )
-- insert value into the first table , or do a dummy update if it's already there
ins1 as 
(insert  into test1(uq_value1) values('foo') on conflict on constraint unique_uq_val1 do update 
    set  uq_value1=test1.uq_value1
returning test1_id 
-- insert value into the second table , or do a dummy update if it's already there
ins2 as 
(insert  into test2(uq_value2) values('bar') on conflict on constraint unique_uq_val2  do update 
    set  uq_value2=test2.uq_value2 
returning test2_id 
-- select PK of inserted records (since there is exactly one record in each insert,
-- cross join is used
sel_1 as 
select test1_id,test2_id
from ins1  
cross join ins2 
-- finally insert into link tables if such a record doesn't exist : 
insert into test1_test2(test1_id, test2_id)
select * from sel_1  a 
where not exists( select null from test1_test2 b where (b.test1_id, b.test2_id) = (a.test1_id, a.test2_id))
Upserting data with existing many-to-many connections
See more ...