LLM Data Privacy - Synthetic Data vs. Tokenization?
The best way to protect sensitive data in LLMS - synthetic data and tokenization?
April 23rd, 2024
In our earlier blog post, we talked about Postgres Anonymizer or PGAnonymizer and how engineering teams can use it to anonymize sensitive data in their Postgres databases. While PG Anonymizer works well for many use cases, there are some use cases that it doesn't work so well. And in those cases, you may need an alternative. In this blog post, we're going to review some alternatives to PG Anonymizer and their strengths and weaknesses.
Let's jump in.
Neosync is an open source synthetic test data platform that anonymizes and generates synthetic data and orchestrates it across environments.
Pros:
Cons:
One of the most commonly used open source libraries is Faker. Faker started out as a Python library but has now been ported over to Golang, Javascript, C++ and other runtimes. Although not all distributions are equal in their flexibility and extensibility. We find that the Python runtime is still the most built-out.
For simple projects, faker is very easy to get up and running. In Python, you can have faker working in just 3 lines of code.
from faker import Faker
fake = Faker()
fake.name()
# John Doe
Though is a very bare bones implementation. In reality, you'll have to write a bit of customization to get to this to generate many rows of data and to fit the schema and database that you're working with.
Pros:
Cons:
YData is a startup that works with machine learning engineers and AI companies to help them generate synthetic data mainly for machine learning use-cases. They have a clean python SDK that is easy to use and can quickly generate synthetic data and run data quality checks. Additionally, they have a data cataloging tool that helps teams understand their data.
Pros:
Cons:
Tonic AI is a company that mainly focuses on creating and orchestrating test data for developers. They've been in the market since 2019 and are established in the space. They have a strong data anonymization feature set and support most databases. Let's take a look at their pros and cons.
Pros:
Cons:
Gretel AI is another synthetic data company that is more similar to YData than Tonic. Gretel supports workflows for machine learning engineers and developers and can generate synthetic data for tabular and relational databases.
Pros:
Cons:
In this blog we covered a few alternatives to PG Anonymizer and their pros and cons. Depending on your use case, PG Anonymizer may work just fine, but if you need advanced data anonymization features, orchestration across databases and more control over our synthetic data than one of these alternative tools may do the job.
The best way to protect sensitive data in LLMS - synthetic data and tokenization?
April 23rd, 2024
A walkthrough of how we reduced the time it takes to generate data in Neosync by 50% + benchmarks.
April 10th, 2024