LLM Data Privacy - Synthetic Data vs. Tokenization?
The best way to protect sensitive data in LLMS - synthetic data and tokenization?
April 23rd, 2024
PGDUMP (or pg_dump) is a Postgres utility (source) that comes bundled with every PostgreSQL installation. It is a command-line tool that connects to a PostgreSQL server and constructs queries to extract database metadata and/or table data, including all of the constraints. This exported file can then be used to recreate the database on the same or another PostgreSQL server. This is really useful for backups and data migrations.
Let's take a step-by-step look at exactly how PGDUMP works under the hood and then at the end go through an example.
When you run the pg_dump command, the first thing it does is establish a connection to the PostgreSQL server using the provided connection string. It uses the same libpq library as the psql command-line client. This is nice because it ensures compatability across versions.
To ensure a consistent snapshot of the database, pg_dump leverages PostgreSQL's MVCC (Multi-Version Concurrency Control). This means it can safely back up the database without locking out concurrent users. This is usually fine for most operations but large transactions might be impacted.
Once connected, pg_dump queries the PostgreSQL system catalogs to retrieve metadata about the target database. This includes information on schemas, tables, data types, indexes, constraints, views, functions, and more. It fetches all the database object definitions and their properties.
Let's go through these in more detail.
pg_database
system catalog to get general properties of the database like encoding, tablespace, etc.pg_extension
to get a list of installed extensions in the database.pg_namespace
to get a list of all schemas in the database.pg_type
to get a list of data types defined in that schema.pg_class
and pg_attribute
to get a list of tables and their column definitions.pg_constraint
and pg_index
to retrieve constraints (primary keys, foreign keys, etc.)
and index definitions.pg_inherits
system catalog.pg_class
again to identify views and sequences in each schema.pg_proc
to get functions, procedures, triggers, etc. defined in each schema.pg_operator
and pg_opclass
for user-defined operators and operator classes.pg_description
to retrieve comments on database objects.It's important to note that pg_dump doesn't query the tables directly. Instead, it queries the system catalogues which contain all of the metadata representing the database. The order also matters. For example, data types are queried before tables since tables may use custom types. It then buffers this metadata in memory as it gets ready for step 2.
Now that we have the metadata, pg_dump then starts the process of extracting the raw data for each table in the database. There are configurations for pg_dump where you only get the schema. If so, then this step is effectively skipped. If we want the data as well, then here's a breakdown of what happens:
SELECT \* FROM schema.table
for each table.COPY
command. For example, COPY schema.table (col1, col2) FROM stdin; string1 string2 \.
Tables with foreign keys or inheritance hierarchies are dumped in an order that preserves referential integrity. Data is dumped sorted by OID (object identifier) to maintain consistency across runs. Lastly, pg_dump has special handling for large objects. It typically uses the lo_export function to manage these objects.
At this point, pg_dump has, in memory, the contents of the pg_dump output file. Just in time for step 3!
As pg_dump fetches metadata definitions and table data, it incrementally constructs the dump file by writing out sections in this order:
The output is written as plain text with SQL statements intermixed with PGDUMP directives that indicate which database objects a section applies to.
By default, pg_dump outputs the complete SQL script to standard output (stdout). You can optionally specify an output file to write to instead.
Additionally, pg_dump can also compress the output SQL file using the specified compression format (gzip, bzip2, etc). The compressed file can then be transferred and decompressed on the destination system.
Let's go through an example of using pg_dump to back up the "neosync_test" sample database:
$ pg_dump -U postgres neosync_test > neosync_test
This command connects to the PostgreSQL server as the "postgres" user and dumps the complete "neosync_test" database as plain text SQL into the file "neosync_test".
Here's a high-level look at what the dump will look like:
--
-- PostgreSQL database dump
--
-- Dumped from database version 16.0 (Debian 16.0-1.pgdg120+1)
-- Dumped by pg_dump version 16.2 (Homebrew)
SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;
--
-- Name: neosync_test; Type: SCHEMA; Schema: -; Owner: postgres
--
CREATE SCHEMA neosync_test;
ALTER SCHEMA neosync_test OWNER TO postgres;
--
-- Name: uuid-ossp; Type: EXTENSION; Schema: -; Owner: -
--
CREATE EXTENSION IF NOT EXISTS "uuid-ossp" WITH SCHEMA public;
--
-- Name: EXTENSION "uuid-ossp"; Type: COMMENT; Schema: -; Owner:
--
COMMENT ON EXTENSION "uuid-ossp" IS 'generate universally unique identifiers (UUIDs)';
SET default_tablespace = '';
SET default_table_access_method = heap;
--
-- Name: users; Type: TABLE; Schema: neosync_test; Owner: postgres
--
CREATE TABLE neosync_test.users (
email character varying(60) NOT NULL,
id uuid DEFAULT public.uuid_generate_v4() NOT NULL,
first_name character varying(20) DEFAULT 'evis'::character varying NOT NULL,
last_name character varying(255) DEFAULT 'drenova'::character varying NOT NULL,
weight integer DEFAULT 30 NOT NULL,
middle_name character varying DEFAULT 'bj'::character varying NOT NULL
);
ALTER TABLE neosync_test.users OWNER TO postgres;
--
-- Data for Name: users; Type: TABLE DATA; Schema: neosync_test; Owner: postgres
--
COPY neosync_test.users (email, id, first_name, last_name, weight, middle_name) FROM stdin;
evis@gmail.com a293f9fe-96dc-4e3d-84bd-1b966356e54b evis hello 30 bj
johndoe@gmail.com 152983c0-824f-4fc7-896e-5d56c8560e6c 1234567890 hello 30 bj
evistest@hotmrgergail.com 54f01b8b-c423-4c3b-b1eb-d2cd15271d13 evis joijfio32j3JIO 30 bj
evispgdump@hotmrgergail.com 54f01b8b-c423-4c3b-b1eb-d2cd15271d19 evis joijfio32j3JIO 30 bj
\.
--
-- Name: users users_email_key; Type: CONSTRAINT; Schema: neosync_test; Owner: postgres
--
ALTER TABLE ONLY neosync_test.users
ADD CONSTRAINT users_email_key UNIQUE (email);
--
-- Name: users users_pkey; Type: CONSTRAINT; Schema: neosync_test; Owner: postgres
--
ALTER TABLE ONLY neosync_test.users
ADD CONSTRAINT users_pkey PRIMARY KEY (id);
--
-- PostgreSQL database dump complete
--
The dump contains the DDL to create all schemas, tables, data types, and other objects first. Then it has the raw INSERT statements populated with data for each table. Finally, it includes the definitions for constraints, indexes, views, functions, and other database objects.
When this dump file is run through psql on another PostgreSQL server, it will accurately reconstruct the complete "neosync_test" database from scratch.
Pg_dump is used by thousands of companies every day to back up their databases. Conceptually, it's pretty simple but the implementation is quite complex. The source code is about 18k lines long! It's fun to take a deep dive under the covers and see exactly what's going on. I hope this technical deep dive should give you a solid understanding of the detailed process pg_dump goes through.
The best way to protect sensitive data in LLMS - synthetic data and tokenization?
April 23rd, 2024
A walkthrough of how we reduced the time it takes to generate data in Neosync by 50% + benchmarks.
April 10th, 2024