Version: 0.1.0
DuckDB Version: ≥ v1.4.2
The Anofox Tabular extension provides comprehensive data quality validation, anomaly detection, and data diffing capabilities directly within DuckDB. All validation and analysis computations are performed by native C++ implementations with optional integration to external libraries.
- 9 Production-Ready Modules: Email, postal, phone, money, VAT, PII, metrics, anomalies, and diffing
- 62 SQL Functions: Complete validation and analysis toolkit
- Zero Friction: SQL-native with no external services required (except optional libpostal for address parsing)
- Blazing Fast: Vectorized C++17 implementation processes millions of rows per second
- Self-Contained: Embedded validation patterns; no API keys or network calls required
Functions follow consistent naming patterns:
anofox_tab_*prefix for all extension functions (with aliases without prefix)anofox_tab_*_is_validsuffix for validation functionsanofox_tab_*_formatsuffix for formatting functionsanofox_tab_metric_*prefix for data quality metricsanofox_tab_diff_*prefix for data diffing operations
All functions have aliases without the anofox_tab_ prefix. For example:
anofox_tab_email_is_validcan also be called asemail_is_validanofox_tab_vatcan also be called asvat
Important: All functions use positional parameters, NOT named parameters (:= syntax).
Common Parameter Types:
email: Email address -VARCHARaddress: Street address -VARCHARnumber: Phone number -VARCHARregion: ISO region code -VARCHARmoney: Money struct -STRUCT(amount DECIMAL(18,3), currency VARCHAR)(exact decimal amounts; NaN/Inf unrepresentable, overflow raises)vat_string: VAT number -VARCHARtable_name: Source table -VARCHARcolumn_name: Column name -VARCHAR
- Email Validation
- Address Parsing & Normalization
- Phone Number Validation
- Money & Currency Operations
- VAT Validation
- PII Detection
- Data Quality Metrics
- Anomaly Detection
- Data Diffing
- Configuration
- Function Coverage Matrix
- Notes
Multi-stage email verification with configurable validation modes.
Quick boolean validation of email addresses.
Signature:
anofox_tab_email_is_valid(email VARCHAR [, mode VARCHAR]) → BOOLEANParameters:
email: Email address to validatemode: Validation mode -'regex'(default),'dns', or'smtp'
Returns:
BOOLEAN:trueif email is valid,falseotherwise
Example:
SELECT anofox_tab_email_is_valid('user@example.com', 'regex');
-- or using alias:
SELECT email_is_valid('user@example.com', 'regex');
-- Returns: trueDetailed email validation with stage information, failure reasons, MX hosts, and SMTP transcripts.
Signature:
anofox_tab_email_validate(email VARCHAR [, mode VARCHAR]) → STRUCTReturns:
STRUCT(
valid BOOLEAN,
stage VARCHAR, -- 'regex', 'dns', or 'smtp'
reason VARCHAR, -- Failure reason if invalid
mx_hosts VARCHAR[], -- MX record hosts (DNS/SMTP modes)
smtp_transcript VARCHAR[] -- SMTP conversation log (SMTP mode)
)Example:
SELECT anofox_tab_email_validate('support@example.org', 'smtp');
-- or using alias:
SELECT email_validate('support@example.org', 'smtp');
-- Returns: {valid: true, stage: 'smtp', reason: NULL, mx_hosts: [...], smtp_transcript: [...]}Returns current email validation configuration settings.
Signature:
anofox_tab_email_config() → TABLE(key VARCHAR, value VARCHAR)Returns:
- Configuration keys:
dns_timeout_ms,dns_tries,smtp_port,smtp_connect_timeout_ms,smtp_read_timeout_ms,smtp_helo_domain,smtp_mail_from,trace_enabled,trace_level
Example:
SELECT * FROM anofox_tab_email_config();
-- or using alias:
SELECT * FROM email_config();Powered by libpostal, a statistical NLP library for parsing addresses.
Parse unstructured addresses into structured components.
Signature:
anofox_tab_postal_parse_address(address VARCHAR) → STRUCTReturns:
STRUCT(
house_number VARCHAR,
road VARCHAR,
city VARCHAR,
state VARCHAR,
postcode VARCHAR,
country VARCHAR
)Example:
SELECT anofox_tab_postal_parse_address('620 Bolger Place, The Burren, NSW 4726');
-- Returns: {house_number: '620', road: 'Bolger Place', city: 'The Burren', state: 'NSW', postcode: '4726', country: NULL}Generate normalized address variants for fuzzy matching.
Signature:
anofox_tab_postal_expand_address(address VARCHAR) → LIST<VARCHAR>Returns:
LIST<VARCHAR>: Array of normalized address variants
Example:
SELECT anofox_tab_postal_expand_address('123 Main St');
-- Returns: ['123 Main Street', '123 Main St', '123 main street', ...]Returns library initialization status and data availability.
Signature:
anofox_tab_postal_status() → TABLEReturns:
- Status information including
initialized,data_present,data_directory
Example:
SELECT * FROM anofox_tab_postal_status();Download and extract libpostal data (~500MB) to the configured data directory.
Signature:
anofox_tab_postal_load_data() → BOOLEANReturns:
BOOLEAN:trueif data was successfully loaded,falseotherwise
Example:
SELECT anofox_tab_postal_load_data();International phone parsing via libphonenumber, Google's library for parsing and formatting phone numbers.
Parse and validate phone numbers with detailed information.
Signature:
anofox_tab_phonenumber_parse(number VARCHAR, region VARCHAR) → STRUCTReturns:
STRUCT(
valid BOOLEAN,
country_code INTEGER,
national_number VARCHAR,
region_code VARCHAR,
type VARCHAR
)Example:
SELECT anofox_tab_phonenumber_parse('+1 (415) 555-1234', 'US');
-- Returns: {valid: true, country_code: 1, national_number: '4155551234', region_code: 'US', type: 'FIXED_LINE_OR_MOBILE'}Format phone numbers in different styles.
Signature:
anofox_tab_phonenumber_format(number VARCHAR, region VARCHAR, format VARCHAR) → VARCHARParameters:
format: Format style -'E164','INTERNATIONAL','NATIONAL', or'RFC3966'(case-insensitive). Any other value raises anInvalid Inputerror.
Returns:
VARCHAR: Formatted phone number, orNULLif the number cannot be parsed as a valid phone number
Example:
SELECT anofox_tab_phonenumber_format('4155551234', 'US', 'INTERNATIONAL');
-- Returns: '+1 415-555-1234'Extract ISO region code from phone number.
Signature:
anofox_tab_phonenumber_region(number VARCHAR, region VARCHAR) → VARCHARReturns:
VARCHAR: ISO region code (e.g., 'US', 'GB', 'DE')
Example:
SELECT anofox_tab_phonenumber_region('+1 415-555-1234', 'US');
-- Returns: 'US'Full validation using length and prefix information.
Signature:
anofox_tab_phonenumber_is_valid(number VARCHAR, region VARCHAR) → BOOLEANReturns:
BOOLEAN:trueif phone number is valid
Example:
SELECT anofox_tab_phonenumber_is_valid('+1 415-555-1234', 'US');
-- Returns: trueQuick possibility check using length-only analysis.
Signature:
anofox_tab_phonenumber_is_possible(number VARCHAR, region VARCHAR) → BOOLEANReturns:
BOOLEAN:trueif phone number is possible (may not be valid)
Example:
SELECT anofox_tab_phonenumber_is_possible('4155551234', 'US');
-- Returns: trueRegion-specific validation.
Signature:
anofox_tab_phonenumber_is_valid_for_region(number VARCHAR, region VARCHAR) → BOOLEANReturns:
BOOLEAN:trueif phone number is valid for the specified region
Fuzzy matching between two phone numbers.
Signature:
anofox_tab_phonenumber_match(number1 VARCHAR, number2 VARCHAR, region VARCHAR) → VARCHARReturns:
VARCHAR: Match type -'EXACT_MATCH','NSN_MATCH','SHORT_NSN_MATCH', or'NO_MATCH'
Example:
SELECT anofox_tab_phonenumber_match('+1 415-555-1234', '4155551234', 'US');
-- Returns: 'EXACT_MATCH'Generate example phone number for a region.
Signature:
anofox_tab_phonenumber_example(region VARCHAR) → VARCHARReturns:
VARCHAR: Example phone number for the region
Example:
SELECT anofox_tab_phonenumber_example('US');
-- Returns: '+1 650-253-0000'Returns library status and default region.
Signature:
anofox_tab_phonenumber_status() → TABLEReturns:
- Status information including
initialized,default_region
International monetary value handling with currency-aware arithmetic and formatting.
Money amounts are exact DECIMAL(18,3) values: arithmetic never drifts, NaN/Inf are unrepresentable, and results outside the DECIMAL(18,3) range raise an out-of-range error instead of overflowing to Inf. money() takes a DOUBLE amount (must be finite, rounded half away from zero to 3 decimals); money_from_cents() constructs exactly from integer subunits.
10 major currencies: USD, EUR, GBP, JPY, CAD, AUD, CHF, CNY, INR, BRL
Create a money value from amount and currency code.
Signature:
anofox_tab_money (alias: money(amount DOUBLE, currency_code VARCHAR) → STRUCT(amount DECIMAL(18,3), currency VARCHAR)Example:
SELECT anofox_tab_money (alias: money(100.50, 'USD');
-- Returns: {amount: 100.5, currency: 'USD'}Create a money value from an integer amount in the currency's smallest unit (e.g. cents). The amount is divided by the currency's subunit_to_unit (100 for USD/EUR, 1 for JPY).
Signature:
anofox_tab_money (alias: money_from_cents(cents BIGINT, currency_code VARCHAR) → STRUCT(amount DECIMAL(18,3), currency VARCHAR)Example:
SELECT anofox_tab_money (alias: money_from_cents(10050, 'USD');
-- Returns: {amount: 100.5, currency: 'USD'}Extract amount from money struct.
Signature:
anofox_tab_money (alias: money_amount(money STRUCT) → DECIMAL(18,3)Example:
SELECT anofox_tab_money (alias: money_amount(anofox_tab_money (alias: money(100.50, 'USD'));
-- Returns: 100.5Extract currency code from money struct.
Signature:
anofox_tab_money (alias: money_currency(money STRUCT) → VARCHARExample:
SELECT anofox_tab_money (alias: money_currency(anofox_tab_money (alias: money(100.50, 'USD'));
-- Returns: 'USD'Check if currency code is valid.
Signature:
anofox_tab_is_valid_currency (alias: is_valid_currency(code VARCHAR) → BOOLEANExample:
SELECT anofox_tab_is_valid_currency (alias: is_valid_currency('USD');
-- Returns: trueGet currency symbol (e.g., '$', '€').
Signature:
anofox_tab_currency_ (alias: currency_symbol(code VARCHAR) → VARCHARExample:
SELECT anofox_tab_currency_ (alias: currency_symbol('USD');
-- Returns: '$'Get currency name (e.g., 'United States Dollar').
Signature:
anofox_tab_currency_ (alias: currency_name(code VARCHAR) → VARCHARExample:
SELECT anofox_tab_currency_ (alias: currency_name('USD');
-- Returns: 'United States Dollar'Format money for display. The number of decimals is derived from the currency's subunit_to_unit (JPY has none, USD/EUR have two). The 'symbol' and 'long' styles localize the amount with the currency's thousands separator and decimal mark; the 'code' style stays canonical (dot decimal mark, no grouping).
Signature:
anofox_tab_money (alias: money_format(money STRUCT, style VARCHAR) → VARCHARParameters:
style: Format style -'symbol'(e.g.$1,234,567.89,1.234.567,89 €,¥1,235),'code'(e.g.1234567.89 USD), or'long'(e.g.1.234,50 Euro)
Example:
SELECT anofox_tab_money (alias: money_format(anofox_tab_money (alias: money(150.00, 'EUR'), 'symbol');
-- Returns: '150,00 €'Check if amount > 0.
Signature:
anofox_tab_money (alias: money_is_positive(money STRUCT) → BOOLEANCheck if amount < 0.
Signature:
anofox_tab_money (alias: money_is_negative(money STRUCT) → BOOLEANCheck if amount == 0.
Signature:
anofox_tab_money (alias: money_is_zero(money STRUCT) → BOOLEANGet absolute value (sign removed).
Signature:
anofox_tab_money (alias: money_abs(money STRUCT) → STRUCT(amount DECIMAL(18,3), currency VARCHAR)Add two money values (same currency required).
Signature:
anofox_tab_money (alias: money_add(money1 STRUCT, money2 STRUCT) → STRUCT(amount DECIMAL(18,3), currency VARCHAR)Example:
SELECT anofox_tab_money (alias: money_add(
anofox_tab_money (alias: money(100.00, 'EUR'),
anofox_tab_money (alias: money(50.00, 'EUR')
);
-- Returns: {amount: 150.0, currency: 'EUR'}Subtract money2 from money1 (same currency required).
Signature:
anofox_tab_money (alias: money_subtract(money1 STRUCT, money2 STRUCT) → STRUCT(amount DECIMAL(18,3), currency VARCHAR)Multiply money by a scalar factor.
Signature:
anofox_tab_money (alias: money_multiply(money STRUCT, factor DOUBLE) → STRUCT(amount DECIMAL(18,3), currency VARCHAR)Check if amount is within range.
Signature:
anofox_tab_money (alias: money_in_range(money STRUCT, min DOUBLE, max DOUBLE) → BOOLEANExample:
SELECT anofox_tab_money (alias: money_in_range(anofox_tab_money (alias: money(100.00, 'USD'), 0.01, 99999.99);
-- Returns: trueCheck if two money values have same currency.
Signature:
anofox_tab_money (alias: money_same_currency(money1 STRUCT, money2 STRUCT) → BOOLEANEuropean VAT number validation for regulatory compliance and data quality.
29 countries supported (28 EU + UK): AT, BE, BG, CY, CZ, DE, DK, EE, EL, ES, FI, FR, HR, HU, IE, IT, LT, LU, LV, MT, NL, PL, PT, RO, SE, SI, SK, UK
Parse VAT string into country and digits.
Signature:
anofox_tab_vat (alias: vat(vat_string VARCHAR) → STRUCT(country VARCHAR, digits VARCHAR)Example:
SELECT anofox_tab_vat (alias: vat('DE123456789');
-- Returns: {country: 'DE', digits: '123456789'}Check if country code is valid VAT country. Codes are case-insensitive and the
VAT aliases EL (Greece) and XI (Northern Ireland / GB) are accepted.
Signature:
anofox_tab_is_valid_vat_country (alias: is_valid_vat_country(code VARCHAR) → BOOLEANExample:
SELECT anofox_tab_is_valid_vat_country (alias: is_valid_vat_country('DE');
-- Returns: trueNormalize VAT string (uppercase, remove punctuation).
Signature:
anofox_tab_vat (alias: vat_normalize(vat_string VARCHAR) → VARCHARExample:
SELECT anofox_tab_vat (alias: vat_normalize('de 123-456-789');
-- Returns: 'DE123456789'Validate VAT syntax against country pattern.
Signature:
anofox_tab_vat (alias: vat_is_valid_syntax(vat_string VARCHAR) → BOOLEANParse VAT into country and normalized digits.
Signature:
anofox_tab_vat (alias: vat_split(vat_string VARCHAR) → STRUCT(country VARCHAR, digits VARCHAR)Check if VAT has valid country prefix.
Signature:
anofox_tab_vat (alias: vat_exists(vat_string VARCHAR) → BOOLEANCheck if country is EU member. Codes are case-insensitive and the VAT aliases
EL/XI are normalized (e.g. vat_is_eu_member('EL') is true, 'XI' is false).
Signature:
anofox_tab_vat (alias: vat_is_eu_member(country_code VARCHAR) → BOOLEANGet full country name.
Signature:
anofox_tab_vat (alias: vat_country_name(country_code VARCHAR) → VARCHARExample:
SELECT anofox_tab_vat (alias: vat_country_name('DE');
-- Returns: 'Germany'Format VAT for display. Supported styles (case-insensitive):
'plain'— digits only, without country prefix'iso'— VAT country prefix + digits (Greece usesEL, Northern IrelandXI)
Unparseable VAT numbers return NULL; unknown styles raise an error.
Signature:
anofox_tab_vat (alias: vat_format(vat_string VARCHAR, style VARCHAR) → VARCHARExample:
SELECT anofox_tab_vat_format('de 123-456-789', 'iso');
-- Returns: 'DE123456789'
SELECT anofox_tab_vat_format('GR123456783', 'plain');
-- Returns: '123456783'Full validation: syntax check plus country check-digit (checksum) validation where implemented.
Check digits are verified for: AT, BE, DE, DK, ES, FI, FR, GR/EL, IE, IT, LU, NL, PL, PT, SE, SI. Other countries are validated by syntax only. French VAT keys containing letters are accepted without check-digit verification.
Signature:
anofox_tab_vat (alias: vat_is_valid(vat_string VARCHAR) → BOOLEANExample:
SELECT anofox_tab_vat_is_valid('DE111111125');
-- Returns: true
SELECT anofox_tab_vat_is_valid('DE123456789');
-- Returns: false (valid syntax, invalid check digit)Detect and mask Personally Identifiable Information (PII) in text data. Supports multiple PII types with configurable masking strategies.
| Type | Description | Example |
|---|---|---|
| Email addresses | user@example.com |
|
| CREDIT_CARD | Credit card numbers (Visa, MC, Amex, Discover) | 4111-1111-1111-1111 |
| US_SSN | US Social Security Numbers | 123-45-6789 |
| IBAN | International Bank Account Numbers | DE89370400440532013000 |
| IP_ADDRESS | IPv4 addresses | 192.168.1.100 |
| URL | HTTP/HTTPS URLs | https://example.com |
| DE_TAX_ID | German Tax ID (Steueridentifikationsnummer) | 12345678901 |
| MAC_ADDRESS | Network hardware addresses (MAC) | 00:1A:2B:3C:4D:5E |
| UK_NINO | UK National Insurance Number | AB123456C |
| US_PASSPORT | US Passport numbers | A12345678 |
| PHONE | International phone numbers | +1-555-123-4567 |
| API_KEY | API keys (AWS, GitHub, generic) | AKIAIOSFODNN7EXAMPLE |
| CRYPTO_ADDRESS | Cryptocurrency addresses (Bitcoin, Ethereum) | 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa |
| NAME | Person names (NER-based, PER entities) | John Smith |
| ORGANIZATION | Company/organization names (NER-based, ORG entities) | Microsoft, Google Inc |
| LOCATION | Geographic locations (NER-based, LOC entities) | Paris, New York |
| MISC | Miscellaneous entities (NER-based, MISC entities) | French, Nobel Prize |
US_PASSPORT:
- May match 9-digit sequences (invoice numbers, IDs)
- No checksum validation available (passport numbers have none)
- Use context to interpret overlapping matches with US_SSN
PHONE:
- Requires distinctive features (+ prefix or parentheses format)
- Pattern-based detection only, not full libphonenumber validation
- For comprehensive phone parsing, use
anofox_phonenumber_parse()
API_KEY:
- Generic pattern uses entropy threshold (≥3.5 bits/char for 32+ chars)
- AWS (
AKIA...) and GitHub (ghp_,gho_, etc.) patterns are highly specific - Possible false positives with random high-entropy strings
CRYPTO_ADDRESS:
- Bitcoin legacy (P2PKH) and P2SH addresses validated with Base58 checksum (double SHA-256)
- Bitcoin SegWit (bc1) addresses use format validation only (Bech32 checksum deferred)
- Ethereum addresses validated by hex format (no EIP-55 checksum for all-lowercase)
- Partial masking shows first 4 + last 4 characters
NAME, ORGANIZATION, LOCATION, MISC (NER-based entities):
- All four entity types use ML-based NER (Named Entity Recognition) with OpenVINO + DistilBERT
- Model trained on CoNLL-2003 dataset with 92% F1 score
- OpenVINO is required and installed via vcpkg with AUTO device selection (CPU/GPU/NPU)
- Model downloads automatically from HuggingFace on first use (~66 MB quantized ONNX)
- Use
SELECT * FROM anofox_ner_status()to check NER availability - Confidence threshold: 0.7 (70%) for entity acceptance
- NAME has dictionary fallback if NER unavailable; ORGANIZATION, LOCATION, MISC require NER
- Platform Availability:
- ✅ Linux x64 (glibc), macOS x64, macOS ARM64 - Full NER detection available
- ❌ Windows x64 - OpenVINO not installed, NAME uses dictionary fallback, ORG/LOC/MISC unavailable
- ❌ Linux ARM64 - vcpkg OpenVINO build timeout issues, excluded from CI
- ❌ Linux musl (Alpine) - OpenVINO requires glibc, excluded from CI
- Entity types:
- NAME: Person names (PER entities) - e.g., "John Smith"
- ORGANIZATION: Company/org names (ORG entities) - e.g., "Microsoft"
- LOCATION: Geographic locations (LOC entities) - e.g., "Paris"
- MISC: Miscellaneous entities (events, languages) - e.g., "French", "Nobel Prize"
- Case sensitive: all-caps text may produce fragmented or no results
| Strategy | Description | Example Output |
|---|---|---|
redact |
Replace with type label | [EMAIL], [US_SSN] |
partial |
Show partial value | te**@example.com, ***-**-6789 |
asterisk |
Replace with asterisks (same length) | **************** |
hash |
Replace with SHA-256 hash (truncated) | a1b2c3d4e5f6... |
Detect all PII in input text and return matches as JSON.
Signature:
anofox_tab_pii_detect(text VARCHAR) → VARCHARReturns:
VARCHAR: JSON array of detected PII matches with type, position, and confidence
Example:
SELECT pii_detect('Contact: john.doe@example.com, SSN: 123-45-6789');
-- Returns: [{"type":"EMAIL","text":"john.doe@example.com","start":9,"end":29,"confidence":1.00},
-- {"type":"US_SSN","text":"123-45-6789","start":36,"end":47,"confidence":1.00}]
-- NER-based entity detection example:
SELECT pii_detect('John Smith from Microsoft visited Paris to study French');
-- Returns: [{"type":"NAME","text":"John Smith","start":0,"end":10,"confidence":1.00},
-- {"type":"ORGANIZATION","text":"Microsoft","start":16,"end":25,"confidence":1.00},
-- {"type":"LOCATION","text":"Paris","start":34,"end":39,"confidence":1.00},
-- {"type":"MISC","text":"French","start":49,"end":55,"confidence":1.00}]Mask all detected PII in text using specified strategy.
Signature:
anofox_tab_pii_mask(text VARCHAR [, strategy VARCHAR]) → VARCHARParameters:
text: Input text containing potential PIIstrategy: Masking strategy -'redact'(default),'partial','asterisk', or'hash'
Returns:
VARCHAR: Text with PII masked according to strategy
Examples:
-- Default (redact) strategy
SELECT pii_mask('Email: test@example.com');
-- Returns: 'Email: [EMAIL]'
-- Partial masking
SELECT pii_mask('SSN: 123-45-6789', 'partial');
-- Returns: 'SSN: ***-**-6789'
-- Asterisk masking
SELECT pii_mask('Card: 4111111111111111', 'asterisk');
-- Returns: 'Card: ****************'Check if text contains any PII.
Signature:
anofox_tab_pii_contains(text VARCHAR) → BOOLEANReturns:
BOOLEAN:trueif any PII detected,falseotherwise
Example:
SELECT pii_contains('Contact us at support@company.com');
-- Returns: true
SELECT pii_contains('Hello, world!');
-- Returns: falseCount the number of PII matches in text.
Signature:
anofox_tab_pii_count(text VARCHAR) → BIGINTReturns:
BIGINT: Number of PII instances detected
Example:
SELECT pii_count('Email: a@b.com, SSN: 123-45-6789, Card: 4111111111111111');
-- Returns: 3Show all registered PII recognizers with metadata.
Signature:
pii_status() → TABLE(pii_type, recognizer_name, enabled, pattern_info)Returns:
pii_type(VARCHAR): PII type identifier (EMAIL, CREDIT_CARD, etc.)recognizer_name(VARCHAR): Human-readable nameenabled(BOOLEAN): Always true (no disable mechanism)pattern_info(VARCHAR): Description of validation method
Example:
SELECT * FROM pii_status() ORDER BY pii_type;
-- Shows all 17 PII types with their recognizer infoShow NER (Named Entity Recognition) model status for NAME detection.
Signature:
anofox_ner_status() → TABLE(onnx_available, model_status, model_path, model_size_mb, status_message)Returns:
onnx_available(BOOLEAN): Whether OpenVINO is compiled in (column name kept for backward compatibility)model_status(VARCHAR): NOT_LOADED, DOWNLOADING, LOADED, FAILED, or NOT_AVAILABLEmodel_path(VARCHAR): Path to ONNX model file (or "N/A")model_size_mb(DOUBLE): Model file size in MB (0 if not loaded)status_message(VARCHAR): Human-readable status description
Example:
SELECT * FROM anofox_ner_status();
-- onnx_available | model_status | model_path | model_size_mb | status_message
-- true | LOADED | /home/user/.duckdb/extensions/anofox/ner/distilbert-en/model_quantized.onnx | 66.5 | Model loaded successfullyNER configuration options:
| Option | Default | Description |
|---|---|---|
anofox_ner_model |
distilbert-en |
NER model to use. Only models with downloadable assets are accepted; the value cannot be changed after the model has been loaded (restart required). |
anofox_ner_device |
AUTO |
OpenVINO inference device (AUTO, CPU, GPU, GPU.0, ...). Takes effect on next model load. |
anofox_ner_cache_size |
10000 |
LRU cache size for NER results (0 disables caching). |
NER input limits:
- Inputs are truncated at the model's maximum sequence length (512 tokens for DistilBERT). Entities beyond the first ~512 tokens are not detected.
- The built-in WordPiece tokenizer is ASCII-focused: UTF-8 multibyte text is kept intact inside words, but non-ASCII punctuation/whitespace is not recognized as a word boundary. The bundled DistilBERT model is English-only.
Scan entire table for PII across VARCHAR columns.
Signature:
pii_scan_table(table_name VARCHAR [, columns VARCHAR])
→ TABLE(column_name, pii_type, match_count, sample_values, confidence)Parameters:
table_name: Name of table to scancolumns: Optional comma-separated column names (scans all VARCHAR if omitted)
Returns:
column_name(VARCHAR): Column containing PIIpii_type(VARCHAR): Type of PII detectedmatch_count(BIGINT): Number of matches foundsample_values(VARCHAR[]): Up to 5 sample valuesconfidence(DOUBLE): Average confidence (always 1.0)
Examples:
-- Scan all columns
SELECT * FROM pii_scan_table('users');
-- Scan specific columns
SELECT * FROM pii_scan_table('users', 'email,phone');
-- Find high-risk columns
SELECT column_name, COUNT(DISTINCT pii_type) as pii_types
FROM pii_scan_table('customer_data')
GROUP BY column_name
HAVING COUNT(DISTINCT pii_type) >= 2;Row-level PII audit with masking for detailed inspection.
Signature:
pii_audit_table(table_name VARCHAR [, columns VARCHAR])
→ TABLE(row_id, column_name, pii_type, original_value, masked_value, start_pos, end_pos, confidence)Parameters:
table_name: Name of table to auditcolumns: Optional comma-separated column names (audits all VARCHAR if omitted)
Returns:
row_id(BIGINT): Row number in the tablecolumn_name(VARCHAR): Column containing PIIpii_type(VARCHAR): Type of PII detectedoriginal_value(VARCHAR): Original text valuemasked_value(VARCHAR): Masked version of the valuestart_pos(BIGINT): Start position of PII in textend_pos(BIGINT): End position of PII in textconfidence(DOUBLE): Detection confidence (0.0-1.0)
Example:
SELECT * FROM pii_audit_table('customer_data') WHERE pii_type = 'US_SSN';Detect only specific types of PII for targeted scans:
| Function | Alias | Returns |
|---|---|---|
anofox_tab_pii_detect_emails |
pii_detect_emails |
List of EMAIL matches |
anofox_tab_pii_detect_phones |
pii_detect_phones |
List of PHONE matches |
anofox_tab_pii_detect_credit_cards |
pii_detect_credit_cards |
List of CREDIT_CARD matches |
anofox_tab_pii_detect_ssns |
pii_detect_ssns |
List of US_SSN matches |
anofox_tab_pii_detect_names |
pii_detect_names |
List of NAME matches (NER-based) |
anofox_tab_pii_detect_ibans |
pii_detect_ibans |
List of IBAN matches |
Example:
SELECT pii_detect_emails('Contact: john@example.com and jane@example.com');
-- Returns only EMAIL matchesValidate specific PII formats with checksum verification:
| Function | Alias | Description |
|---|---|---|
anofox_tab_pii_is_valid_ssn |
pii_is_valid_ssn |
Validate US SSN format (XXX-XX-XXXX) |
anofox_tab_pii_is_valid_iban |
pii_is_valid_iban |
Validate IBAN with MOD-97 checksum |
anofox_tab_pii_is_valid_credit_card |
pii_is_valid_credit_card |
Validate credit card with Luhn algorithm |
anofox_tab_pii_is_valid_nino |
pii_is_valid_nino |
Validate UK National Insurance Number |
anofox_tab_pii_is_valid_de_tax_id |
pii_is_valid_de_tax_id |
Validate German Tax ID (11 digits + checksum) |
anofox_tab_pii_is_valid_crypto_address |
pii_is_valid_crypto_address |
Validate Bitcoin/Ethereum address |
Example:
SELECT pii_is_valid_credit_card('4111111111111111'); -- true (Luhn valid)
SELECT pii_is_valid_iban('DE89370400440532013000'); -- true (MOD-97 valid)Type-specific masking for a single PII type.
Signature:
pii_mask_column(value VARCHAR, pii_type VARCHAR, strategy VARCHAR) → VARCHARExample:
SELECT pii_mask_column('123-45-6789', 'US_SSN', 'partial');
-- Returns: ***-**-6789Mask all PII in text with optional strategy.
Signature:
pii_redact_column(value VARCHAR [, strategy VARCHAR]) → VARCHARExample:
SELECT pii_redact_column('Email: test@example.com');
-- Returns: Email: [EMAIL]Display current PII configuration settings.
Signature:
pii_config() → TABLE(option, value, description)Example:
SELECT * FROM pii_config();| Option | Type | Default | Description |
|---|---|---|---|
anofox_pii_min_confidence |
DOUBLE | 0.5 | Minimum confidence threshold (0.0-1.0) |
anofox_pii_default_mask_strategy |
VARCHAR | 'redact' | Default masking strategy |
anofox_pii_enabled_types |
VARCHAR | '' | Comma-separated list of enabled PII types (empty = all) |
anofox_pii_deep_validation |
BOOLEAN | false | Enable deep email/phone validation |
Example:
SET anofox_pii_min_confidence = 0.7;
SET anofox_pii_default_mask_strategy = 'partial';Track essential data quality dimensions.
Validate row count against thresholds.
Signature:
anofox_tab_metric_ (alias: metric_volume(table_name VARCHAR [, min_rows BIGINT, max_rows BIGINT]) → TABLEReturns:
TABLE: Row count validation results
Example:
SELECT * FROM anofox_tab_metric_ (alias: metric_volume('orders', 1000, 1000000);Check null percentage in a column.
Signature:
anofox_tab_metric_ (alias: metric_null_rate(table_name VARCHAR, column_name VARCHAR [, max_null_rate DOUBLE]) → TABLEReturns:
TABLE: Null rate validation results
Example:
SELECT * FROM anofox_tab_metric_ (alias: metric_null_rate('users', 'email', 0.05);Validate cardinality (distinct count) of a column.
Signature:
anofox_tab_metric_ (alias: metric_distinct_count(table_name VARCHAR, column_name VARCHAR [, min BIGINT, max BIGINT]) → TABLEReturns:
TABLE: Distinct count validation results
Example:
SELECT * FROM anofox_tab_metric_ (alias: metric_distinct_count('products', 'sku', 100, NULL);Check required columns exist in table.
Signature:
anofox_tab_metric_ (alias: metric_schema(table_name VARCHAR, required_cols LIST<VARCHAR>) → TABLEReturns:
TABLE: Schema validation results
Example:
SELECT * FROM anofox_tab_metric_ (alias: metric_schema('table', ['id', 'created_at']);Validate data recency (maximum age check).
Signature:
anofox_tab_metric_ (alias: metric_freshness(table_name VARCHAR, ts_col VARCHAR, max_age INTERVAL [, ref_time TIMESTAMP]) → TABLEReturns:
TABLE: Freshness validation results
Example:
SELECT * FROM anofox_tab_metric_ (alias: metric_freshness('events', 'timestamp', INTERVAL '1 hour');Detect outliers via z-score method (assumes normal distribution).
Signature:
anofox_tab_metric_ (alias: metric_zscore(table_name VARCHAR, column_name VARCHAR [, threshold DOUBLE]) → TABLEParameters:
threshold: Z-score threshold (default: 3.0)
Returns:
TABLE: Outlier detection results with z-scores
Example:
SELECT * FROM anofox_tab_metric_ (alias: metric_zscore('transactions', 'amount', 3.0);Detect outliers via IQR method (non-parametric, robust to distribution).
Signature:
anofox_tab_metric_ (alias: metric_iqr(table_name VARCHAR, column_name VARCHAR [, multiplier DOUBLE]) → TABLEParameters:
multiplier: IQR multiplier (default: 1.5)
Returns:
TABLE: Outlier detection results with IQR scores
Example:
SELECT * FROM anofox_tab_metric_ (alias: metric_iqr('transactions', 'amount', 1.5);Unsupervised anomaly detection algorithms for finding outliers in data.
Industry-grade anomaly detection with isotree-inspired features:
Core Capabilities:
- Categorical Support - Auto-detect VARCHAR columns with random subset splitting
- Extended IF (ndim) - Hyperplane splits for diagonal/curved anomaly patterns
- Density Scoring - Alternative metric based on points-to-volume ratio
- Sample Weights - Weighted sampling for imbalanced datasets
- SCiForest - Information-gain guided splitting with configurable candidates
Univariate Isolation Forest for single column anomaly detection.
Full Signature:
metric_isolation_forest(
table_name VARCHAR,
column_name VARCHAR,
n_trees BIGINT, -- 1-500, default 100
sample_size BIGINT, -- 1-10000, default 256
contamination DOUBLE, -- 0.0-0.5, default 0.1
output_mode VARCHAR, -- 'summary' or 'scores'
ndim BIGINT, -- 1-N, default 1 (Extended IF)
coef_type VARCHAR, -- 'uniform' or 'normal'
scoring_metric VARCHAR, -- 'depth', 'density', or 'adj_depth'
weight_column VARCHAR, -- Column for sample weights (NULL = uniform)
ntry BIGINT, -- 1-100, default 1 (SCiForest)
prob_pick_avg_gain DOUBLE -- 0.0-1.0, default 0.0
) → TABLEParameters:
| Parameter | Range | Default | Description |
|---|---|---|---|
n_trees |
1-500 | 100 | Number of isolation trees |
sample_size |
1-10000 | 256 | Subsample size per tree |
contamination |
0.0-0.5 | 0.1 | Expected anomaly fraction |
output_mode |
- | 'scores' | 'summary' or 'scores' |
ndim |
1-N | 1 | Hyperplane dimensions (Extended IF) |
coef_type |
- | 'uniform' | 'uniform' or 'normal' |
scoring_metric |
- | 'depth' | 'depth', 'density', 'adj_depth' |
weight_column |
- | NULL | Sample weight column name |
ntry |
1-100 | 1 | Split candidates (SCiForest) |
prob_pick_avg_gain |
0.0-1.0 | 0.0 | Gain-based selection probability |
Returns:
TABLE: Anomaly detection results withrow_id,anomaly_score,is_anomaly
Examples:
-- Basic usage (backward compatible)
SELECT * FROM metric_isolation_forest(
'sales_data', 'amount', 100, 256, 0.1, 'scores'
) WHERE is_anomaly = true;
-- Extended IF with hyperplane splits
SELECT * FROM metric_isolation_forest(
'transactions', 'amount', 100, 256, 0.1, 'scores',
3, -- ndim: 3-dimensional hyperplanes
'normal', -- coef_type: normal distribution
'depth' -- scoring_metric
);
-- SCiForest with gain-based selection
SELECT * FROM metric_isolation_forest(
'events', 'value', 100, 256, 0.05, 'scores',
1, 'uniform', 'depth',
NULL, -- weight_column
10, -- ntry: evaluate 10 candidates
0.5 -- prob_pick_avg_gain: 50% gain-based
);Multivariate Isolation Forest for multiple column anomaly detection.
Full Signature:
metric_isolation_forest_multivariate(
table_name VARCHAR,
columns VARCHAR, -- Comma-separated column names
n_trees BIGINT,
sample_size BIGINT,
contamination DOUBLE,
output_mode VARCHAR,
ndim BIGINT,
coef_type VARCHAR,
scoring_metric VARCHAR,
weight_column VARCHAR,
ntry BIGINT,
prob_pick_avg_gain DOUBLE
) → TABLEParameters:
columns: Comma-separated column names (e.g.,'amount, quantity, suspicious_amount')- All other parameters same as univariate version
Returns:
TABLE: Anomaly detection results withrow_id,anomaly_score,is_anomaly
Examples:
-- Basic multivariate
SELECT * FROM metric_isolation_forest_multivariate(
'customer_events', 'purchase_amount, session_duration, page_views',
100, 256, 0.1, 'scores'
) ORDER BY anomaly_score DESC LIMIT 10;
-- Extended IF with density scoring
SELECT * FROM metric_isolation_forest_multivariate(
'transactions', 'amount, quantity, duration',
100, 256, 0.05, 'scores',
3, -- ndim
'normal', -- coef_type
'density' -- scoring_metric: density-based scoring
);Univariate DBSCAN clustering for single column anomaly detection.
Signature:
anofox_tab_dbscan(
table_name VARCHAR,
column_name VARCHAR,
eps DOUBLE, -- optional, default 0.5
min_pts BIGINT, -- optional, default 5
output_mode VARCHAR -- optional, default 'summary'
) → TABLEParameters:
eps: Neighborhood radius, must be > 0.0 (default: 0.5)min_pts: Minimum points for dense region, must be >= 1 (default: 5)output_mode:'summary'(aggregate stats) or'clusters'(per-row results)
All parameters after column_name are optional.
Returns:
'summary': one row withstatus(failwhen noise points exist),cluster_count,noise_count,total_count,noise_rate,largest_cluster_size,eps,min_pts,message'clusters': one row per non-NULL input row withrow_id,value,cluster_id(-1= noise),point_type,neighbor_count,anomaly_score,is_anomalypoint_type:'CORE'(dense region),'BORDER'(cluster edge), or'NOISE'(outlier)
Example:
SELECT * FROM dbscan(
'transactions', 'amount', 10.0, 5, 'clusters'
) WHERE point_type = 'NOISE';Multivariate DBSCAN clustering for multiple column anomaly detection.
Signature:
anofox_tab_dbscan_mv(
table_name VARCHAR,
columns VARCHAR,
eps DOUBLE, -- optional, default 0.5
min_pts BIGINT, -- optional, default 5
output_mode VARCHAR -- optional, default 'summary'
) → TABLEParameters:
columns: Comma-separated column names
Returns:
'summary': same columns asanofox_tab_dbscanplusn_columns(novaluecolumn in'clusters'mode)'clusters': one row per non-NULL input row withrow_id,cluster_id,point_type,neighbor_count,anomaly_score,is_anomaly
Detects outliers using decision tree conditioning, providing human-readable explanations for why specific values are anomalous based on conditional distributions.
Key Features:
- Detects outliers in context (e.g., "salary is high for a Junior Developer")
- Returns natural language explanations with statistical backing
- Uses robust statistics (median + MAD) resistant to outliers
- Supports both numeric and categorical columns
Explainable outlier detection with conditional distributions.
Signature:
-- Simple (3 params, uses defaults)
outlier_tree(table_name VARCHAR, columns VARCHAR, output_mode VARCHAR) → TABLE
-- Full (9 params)
outlier_tree(
table_name VARCHAR,
columns VARCHAR,
output_mode VARCHAR,
max_depth INTEGER,
max_perc_outliers DOUBLE,
min_size_numeric INTEGER,
min_size_categ INTEGER,
z_norm DOUBLE,
z_outlier DOUBLE
) → TABLEParameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
table_name |
VARCHAR | required | Source table name |
columns |
VARCHAR | required | Comma-separated column names to analyze |
output_mode |
VARCHAR | 'summary' | 'summary' or 'outliers' |
max_depth |
INTEGER | 4 | Maximum tree depth |
max_perc_outliers |
DOUBLE | 0.01 | Max fraction of outliers per cluster |
min_size_numeric |
INTEGER | 25 | Min cluster size for numeric targets |
min_size_categ |
INTEGER | 75 | Min cluster size for categorical targets |
z_norm |
DOUBLE | 2.67 | Z-threshold for confidence intervals |
z_outlier |
DOUBLE | 8.0 | Z-threshold for outlier flagging |
Summary Mode Returns:
| Column | Type | Description |
|---|---|---|
status |
VARCHAR | 'pass' or 'fail' |
total_rows |
BIGINT | Rows analyzed |
outlier_count |
BIGINT | Outliers detected |
columns_analyzed |
INTEGER | Columns analyzed |
clusters_evaluated |
BIGINT | Clusters evaluated |
max_depth_reached |
INTEGER | Max tree depth reached |
message |
VARCHAR | Summary message |
Outliers Mode Returns:
| Column | Type | Description |
|---|---|---|
row_id |
BIGINT | Row index (1-indexed) |
column_name |
VARCHAR | Column with outlier |
outlier_value |
VARCHAR | The anomalous value |
cluster_mean |
DOUBLE | Mean in cluster |
cluster_sd |
DOUBLE | SD in cluster |
cluster_size |
BIGINT | Rows in cluster |
z_score |
DOUBLE | Robust z-score |
lower_bound |
DOUBLE | Lower CI bound |
upper_bound |
DOUBLE | Upper CI bound |
conditions |
VARCHAR | JSON array of split conditions |
explanation |
VARCHAR | Human-readable explanation |
outlier_score |
DOUBLE | Rarity score (lower = more anomalous) |
Examples:
-- Summary mode: quick pass/fail check
SELECT * FROM outlier_tree('employees', 'department,salary', 'summary');
-- Outliers mode: get detailed explanations
SELECT row_id, column_name, outlier_value, explanation
FROM outlier_tree('employees', 'department,salary,years_exp', 'outliers');
-- Returns explanations like:
-- "Value 150000 for column 'salary' is unusually high (expected: 52333 ± 7413)
-- when job_title = 'Junior Developer'"
-- With custom parameters for small datasets
SELECT * FROM outlier_tree(
'test_data', 'category,value', 'outliers',
4, -- max_depth
0.5, -- max_perc_outliers
3, -- min_size_numeric
2, -- min_size_categ
2.67, -- z_norm
3.0 -- z_outlier
);Compare tables and identify changes for migration validation and regression testing.
Detailed row-level diff between two tables or views.
Signature:
anofox_tab_diff_joindiff(
source VARCHAR,
target VARCHAR,
primary_key VARCHAR | LIST<VARCHAR>
[, compare_columns LIST<VARCHAR>]
[, include_all BOOLEAN]
) → TABLEParameters:
source: Source table or view nametarget: Target table or view nameprimary_key: Primary key column(s) for matching rows (single name or list)compare_columns: Optional list of columns to compare (default: all non-key columns present in both tables)include_all: Includeunchangedrows in the output (default:false)
Semantics:
- Primary keys are matched NULL-safely (
IS NOT DISTINCT FROM), so rows with NULL key values are compared across sides instead of being misreported as added/removed. - Primary key and compare columns are validated against both table schemas at bind time; missing columns or tables produce a clear binder error.
Returns:
TABLE(
diff_type VARCHAR, -- 'added', 'removed', 'changed', 'unchanged'
<primary key columns>, -- COALESCE of source/target key values
<target columns> -- remaining target columns (NULL for removed rows)
)Example:
SELECT * FROM diff_joindiff('source_tbl', 'target_tbl', ['user_id', 'date'])
WHERE diff_type IN ('added', 'changed')
LIMIT 100;Row-level diff between two tables or views. Currently computes exactly the
same result as diff_joindiff (without compare_columns/include_all).
Signature:
anofox_tab_diff_hashdiff(
source VARCHAR,
target VARCHAR,
primary_key VARCHAR | LIST<VARCHAR>
) → TABLEParameters:
source: Source table or view nametarget: Target table or view nameprimary_key: Primary key column(s) for matching rows (single name or list)
Note: The bisection_threshold and bisection_factor parameters of the
hash/bisection algorithm are not implemented. Passing them raises a binder
error instead of being silently ignored.
Example:
SELECT * FROM diff_hashdiff('source_tbl', 'target_tbl', ['id']);Generate per-column statistics, pairwise correlations, and table-wide quality summaries. Full support for complex types (LIST, MAP, STRUCT, UNION).
Per-column statistical profile of a table — one output row per column.
Signature:
anofox_tab_profile_table(
table_name VARCHAR
[, cols LIST<VARCHAR>]
[, sample_size BIGINT]
[, exact BOOLEAN]
) → TABLEParameters:
table_name: Table or view namecols: Optional column filter (default: all columns)sample_size: Maximum rows to scan (default: 1,000,000)exact: Disable sampling even for large tables (default:false)
Output Schema (27 columns):
| Column | Type | Description |
|---|---|---|
column_name |
VARCHAR | Column name |
column_type |
VARCHAR | DuckDB column type |
row_count |
BIGINT | Total rows in sample |
null_count |
BIGINT | Number of NULL values |
null_rate |
DOUBLE | Fraction of NULLs |
distinct_count |
BIGINT | Number of distinct non-NULL values |
distinct_rate |
DOUBLE | Distinct count / row count |
min_val |
VARCHAR | Minimum value as string (NULL for STRUCT/MAP/UNION) |
max_val |
VARCHAR | Maximum value as string (NULL for STRUCT/MAP/UNION) |
mean |
DOUBLE | Mean (numeric columns only) |
median |
DOUBLE | Median (numeric columns only) |
stddev |
DOUBLE | Standard deviation (numeric columns only) |
p25 |
DOUBLE | 25th percentile (numeric columns only) |
p75 |
DOUBLE | 75th percentile (numeric columns only) |
skewness |
DOUBLE | Skewness (numeric columns only) |
kurtosis |
DOUBLE | Kurtosis (numeric columns only) |
top_values |
LIST(STRUCT) | Most-frequent values with counts |
avg_length |
DOUBLE | Avg string length / list element count / map cardinality |
min_length |
BIGINT | Min string length / list element count / map cardinality |
max_length |
BIGINT | Max string length / list element count / map cardinality |
pattern_summary |
VARCHAR | Detected pattern: email, phone, uuid, url, ip_address, iso_date, numeric_string, mixed, list, map, struct, union |
is_unique |
BOOLEAN | All non-NULL values are distinct |
is_constant |
BOOLEAN | All non-NULL values are the same |
zero_count |
BIGINT | Number of zero values (numeric columns only) |
negative_count |
BIGINT | Number of negative values (numeric columns only) |
is_sampled |
BOOLEAN | True if a sample was taken |
actual_sample_size |
BIGINT | Number of rows actually scanned |
Pattern Detection Priority: email → phone → uuid → url → ip_address → iso_date → numeric_string → mixed. A pattern is assigned when ≥ 80% of non-NULL values match. Phone detection uses the libphonenumber validator (US region hint).
Example:
-- Profile all columns
SELECT column_name, null_rate, pattern_summary, mean
FROM profile_table('orders');
-- Profile specific columns with a 500K row sample
SELECT * FROM profile_table('orders', ['amount', 'customer_id'], 500000);Single-row table-wide quality overview.
Signature:
anofox_tab_profile_summary(table_name VARCHAR) → TABLEOutput Schema (11 columns, one row):
| Column | Type | Description |
|---|---|---|
row_count |
BIGINT | Total rows in the table |
column_count |
BIGINT | Total columns in the table |
numeric_columns |
BIGINT | Number of numeric columns |
string_columns |
BIGINT | Number of string columns |
temporal_columns |
BIGINT | Number of date/timestamp columns |
boolean_columns |
BIGINT | Number of boolean columns |
complex_columns |
BIGINT | Number of complex-type columns (LIST, MAP, STRUCT, UNION) |
total_nulls |
BIGINT | Total NULL cells across all columns |
total_null_rate |
DOUBLE | Fraction of cells that are NULL |
duplicate_row_count |
BIGINT | Number of duplicate rows |
estimated_memory_bytes |
BIGINT | Estimated in-memory size (row_count × column_count × 8 bytes) |
Example:
SELECT * FROM profile_summary('orders');Pairwise Pearson and Spearman correlations for numeric columns.
Signature:
anofox_tab_profile_correlations(
table_name VARCHAR
[, cols LIST<VARCHAR>]
) → TABLEOutput Schema:
| Column | Type | Description |
|---|---|---|
column_a |
VARCHAR | First column name |
column_b |
VARCHAR | Second column name |
pearson |
DOUBLE | Pearson correlation coefficient (−1 to 1) |
spearman |
DOUBLE | Spearman rank correlation coefficient (−1 to 1) |
n |
BIGINT | Number of non-NULL row pairs used |
Example:
SELECT column_a, column_b, pearson
FROM profile_correlations('orders')
WHERE abs(pearson) > 0.7;Set options via SQL or DuckDB's configuration file.
SET anofox_tab_email_default_validation = 'regex'; -- Default: regex
SET anofox_tab_email_regex_pattern = '<your-pattern>'; -- RFC 5322 inspired
SET anofox_tab_email_dns_timeout_ms = 1000; -- DNS timeout per try (1-5000ms)
SET anofox_tab_email_dns_tries = 1; -- DNS retry count (1-10)
SET anofox_tab_email_smtp_port = 25; -- SMTP port
SET anofox_tab_email_smtp_connect_timeout_ms = 5000; -- TCP connect timeout
SET anofox_tab_email_smtp_read_timeout_ms = 5000; -- Read/write timeout
SET anofox_tab_email_smtp_helo_domain = 'duckdb.local'; -- HELO/EHLO domain
SET anofox_tab_email_smtp_mail_from = 'validator@duckdb.local'; -- MAIL FROM addressSET anofox_tab_postal_data_path = '.duckdb/extensions/libpostal'; -- Data directory
-- Download libpostal data on first use
SELECT anofox_tab_postal_load_data();
-- or using alias:
SELECT postal_load_data();SET anofox_tab_phonenumber_default_region = 'US'; -- Default region codeThe default region must be a supported 2-letter ISO region code; unknown codes are rejected. The value is session-scoped and snapshotted when a query is bound, so changing it mid-query has no effect on running queries.
SET anofox_tab_trace_enabled = true; -- Enable/disable logging
SET anofox_tab_trace_level = 'info'; -- trace|debug|info|warn|error|critical|offAnonymous usage telemetry helps improve the extension. No personal data or query content is collected.
What is collected:
- Extension load events (extension name, version, platform)
- Function execution counts (function name only, no arguments or data)
Disable via environment variable (before loading extension):
export DATAZOO_DISABLE_TELEMETRY=1Disable via SQL (after loading extension):
SET anofox_telemetry_enabled = false;Configure API key (advanced):
SET anofox_telemetry_key = 'your_custom_key';| Category | Count | Function Types |
|---|---|---|
| Email Validation | 3 | Scalar (2), Table (1) |
| Address Parsing | 4 | Scalar (2), Table (2) |
| Phone Numbers | 9 | Scalar (8), Table (1) |
| Money & Currency | 17 | Scalar functions |
| VAT Validation | 10 | Scalar functions |
| PII Detection | 20 | 15 scalar + 5 table functions |
| Data Quality Metrics | 8 | Table functions |
| Anomaly Detection | 5 | Table functions |
| Data Diffing | 2 | Table functions |
| Total | 78 |
| Type | Count | Examples |
|---|---|---|
| Scalar Functions | 60 | anofox_tab_email_is_valid (alias: email_is_valid), anofox_tab_pii_detect (alias: pii_detect), anofox_tab_vat_is_valid (alias: vat_is_valid) |
| Table Functions | 18 | anofox_tab_metric_volume (alias: volume), anofox_tab_metric_isolation_forest (alias: isolation_forest), anofox_tab_outlier_tree (alias: outlier_tree) |
| Module | Functions | Status | Dependencies |
|---|---|---|---|
| Email Validation | 3 | Stable | c-ares (optional DNS), OpenSSL (optional SMTP) |
| Address Parsing | 4 | Stable | libpostal (required) |
| Phone Numbers | 9 | Stable | libphonenumber (embedded) |
| Money & Currency | 17 | Stable | None |
| VAT Validation | 10 | Stable | None |
| PII Detection | 20 | Stable | OpenVINO (for NER), OpenSSL (for hash masking) |
| Data Quality Metrics | 8 | Stable | None |
| Anomaly Detection | 5 | Stable | None |
| Data Diffing | 2 | Stable | None |
-
All validation calculations are performed by native C++ implementations with optional integration to external libraries (libpostal, libphonenumber).
-
Positional parameters only: Functions do NOT support named parameters (
:=syntax). Parameters must be provided in the order specified. -
NULL handling:
- Missing values in input will cause functions to return NULL
- Table functions handle NULLs explicitly in their output
-
Performance:
- Scalar functions: Vectorized execution, optimized for large datasets
- Table functions: SQL-based implementation, efficient for large tables
- Automatic parallelization across CPU cores where applicable
-
External dependencies:
- libpostal: Required for address parsing. Data (~500MB) is downloaded automatically on first use via
anofox_tab_postal_load_data(). - c-ares: Optional for DNS email validation
- OpenSSL: Optional for SMTP email validation
- libphonenumber: Embedded implementation, no external dependencies
- libpostal: Required for address parsing. Data (~500MB) is downloaded automatically on first use via
-
Money struct format: All money functions use
STRUCT(amount DECIMAL(18,3), currency VARCHAR)format. Currency codes must match supported currencies (USD, EUR, GBP, JPY, CAD, AUD, CHF, CNY, INR, BRL). -
VAT country codes: Use ISO 3166-1 alpha-2 country codes (e.g., 'DE', 'FR', 'GB'). The extension supports 29 countries (28 EU + UK).
-
Anomaly detection parameters:
- Isolation Forest: Higher
n_treesandsample_sizeimprove accuracy but increase computation time - DBSCAN:
epsandmin_ptsshould be tuned based on data density - Use
'summary'output mode for large tables to reduce memory usage
- Isolation Forest: Higher
-
Data diffing:
anofox_tab_diff_joindiff(aliasdiff_joindiff): detailed row-level changesanofox_tab_diff_hashdiff(aliasdiff_hashdiff): currently identical todiff_joindiff; the bisection parameters are not implemented and rejected- Both functions support compound primary keys and match keys NULL-safely
Last Updated: 2025-01-XX
API Version: 0.1.0