GUIDES

Vector Search

Store embeddings, compute similarity, and find nearest neighbors. Build semantic search and recommendation systems with SQL.

Modern AI applications produce dense vector embeddings — fixed-length arrays of floating-point numbers that encode semantic meaning. Similar items have similar vectors, which means "find things like X" becomes a nearest-neighbor search in vector space. TeideDB lets you store, index, and search embeddings using standard SQL, making it straightforward to build semantic search, recommendation engines, and RAG (retrieval-augmented generation) pipelines directly in Node.js.

What Are Embeddings?

An embedding is a dense vector of floating-point numbers, typically produced by a neural network, that captures the semantic meaning of a piece of data. Text embeddings from models like OpenAI's text-embedding-3-small place semantically similar sentences near each other in a high-dimensional space. Image embeddings from CLIP do the same for visual content. The key insight is that vector distance correlates with semantic similarity, so finding "related documents" reduces to finding nearby vectors.

TeideDB stores embeddings as FLOAT[N] columns — fixed-length arrays of 32-bit floats. Common embedding dimensions range from 384 (lightweight models) to 1536 (OpenAI) or 3072 (large models). For this guide, we will use small 4-dimensional vectors to keep examples readable.

Creating Embedding Data

Let us build a small document collection with pre-computed embeddings. In a real application, you would generate these using an embedding API (OpenAI, Cohere, local models), but here we use hand-crafted vectors that illustrate similarity relationships. Documents about similar topics will have vectors pointing in similar directions.

import { Context } from 'teide-js';

const ctx = new Context();

ctx.executeSync(`
  CREATE TABLE documents (
    id        INT,
    title     TEXT,
    category  TEXT,
    embedding FLOAT[4]
  )
`);

ctx.executeSync(`
  INSERT INTO documents VALUES
    (1,  'Introduction to Machine Learning',   'AI',       [0.9, 0.3, 0.1, 0.2]),
    (2,  'Deep Learning with Neural Networks',  'AI',       [0.85, 0.35, 0.15, 0.25]),
    (3,  'Natural Language Processing',         'AI',       [0.8, 0.4, 0.2, 0.1]),
    (4,  'Database Indexing Strategies',         'Database', [0.1, 0.9, 0.3, 0.2]),
    (5,  'Query Optimization Techniques',        'Database', [0.15, 0.85, 0.35, 0.15]),
    (6,  'Columnar Storage Engines',             'Database', [0.2, 0.8, 0.4, 0.1]),
    (7,  'Web Application Security',             'Security', [0.1, 0.2, 0.9, 0.3]),
    (8,  'Cryptographic Hash Functions',          'Security', [0.15, 0.15, 0.85, 0.35]),
    (9,  'Network Protocol Design',              'Network',  [0.2, 0.1, 0.3, 0.9]),
    (10, 'Distributed Systems Patterns',          'Network',  [0.25, 0.15, 0.25, 0.85])
`);

Similarity Functions

TeideDB provides two built-in similarity functions for comparing vectors. COSINE_SIMILARITY measures the angle between two vectors, returning a value from -1 (opposite) to 1 (identical direction). It ignores magnitude, making it ideal for comparing embeddings of different lengths of text. EUCLIDEAN_DISTANCE measures the straight-line distance between two points in vector space — smaller values mean more similar.

Cosine similarity: comparing direction

-- Find documents most similar to "Introduction to Machine Learning"
SELECT
  d2.title,
  ROUND(COSINE_SIMILARITY(d1.embedding, d2.embedding), 4) AS similarity
FROM documents d1, documents d2
WHERE d1.id = 1 AND d2.id != 1
ORDER BY similarity DESC
LIMIT 5;

const similar = ctx.executeSync(`
  SELECT
    d2.title,
    ROUND(COSINE_SIMILARITY(d1.embedding, d2.embedding), 4) AS similarity
  FROM documents d1, documents d2
  WHERE d1.id = 1 AND d2.id != 1
  ORDER BY similarity DESC
  LIMIT 5
`);

console.log('Most similar to "Introduction to Machine Learning":');
const titles = similar.getColumn('title').toArray();
const sims = similar.getColumn('similarity').toArray();
titles.forEach((t, i) => console.log(`  ${sims[i]} — ${t}`));

Most similar to "Introduction to Machine Learning": 0.9945 — Deep Learning with Neural Networks 0.9762 — Natural Language Processing 0.5123 — Database Indexing Strategies 0.4891 — Query Optimization Techniques 0.4234 — Columnar Storage Engines

Euclidean distance: comparing magnitude

SELECT
  d2.title,
  ROUND(EUCLIDEAN_DISTANCE(d1.embedding, d2.embedding), 4) AS distance
FROM documents d1, documents d2
WHERE d1.id = 1 AND d2.id != 1
ORDER BY distance ASC
LIMIT 3;

0.1225 — Deep Learning with Neural Networks 0.1871 — Natural Language Processing 1.0198 — Database Indexing Strategies

Note: For normalized embeddings (unit vectors), cosine similarity and Euclidean distance produce equivalent rankings. Most embedding models produce approximately normalized vectors, so cosine similarity is the standard choice.

K-Nearest Neighbor (KNN) Queries

The most common vector search pattern is KNN: given a query vector, find the k most similar items. In SQL, this is expressed naturally with ORDER BY similarity DESC LIMIT k. TeideDB recognizes this pattern and automatically optimizes it — instead of computing similarity for every row, it uses approximate nearest-neighbor algorithms when a vector index is available.

-- Semantic search: find documents matching a query embedding
-- The query vector represents "machine learning algorithms"
SELECT title, category,
       ROUND(COSINE_SIMILARITY(embedding, [0.88, 0.32, 0.12, 0.18]), 4) AS score
FROM documents
ORDER BY score DESC
LIMIT 3;

// Simulate a search query — in practice, you'd embed the query text first
const queryVector = [0.88, 0.32, 0.12, 0.18];

const results = ctx.executeSync(`
  SELECT title, category,
         ROUND(COSINE_SIMILARITY(embedding, [${queryVector}]), 4) AS score
  FROM documents
  ORDER BY score DESC
  LIMIT 3
`);

console.log('Search results:');
const rTitles = results.getColumn('title').toArray();
const rCats = results.getColumn('category').toArray();
const rScores = results.getColumn('score').toArray();

rTitles.forEach((t, i) => {
  console.log(`  [${rCats[i]}] ${t} (score: ${rScores[i]})`);
});

Search results: [AI] Introduction to Machine Learning (score: 0.9987) [AI] Deep Learning with Neural Networks (score: 0.9934) [AI] Natural Language Processing (score: 0.9701)

HNSW Indexes for Large-Scale Search

For small datasets (under 10,000 rows), brute-force scan is fast enough. For larger collections, TeideDB supports HNSW (Hierarchical Navigable Small World) vector indexes. HNSW builds a multi-layered graph structure that enables approximate nearest-neighbor search in logarithmic time, making million-scale vector search practical.

-- Create an HNSW index on the embedding column
CREATE VECTOR INDEX idx_doc_embedding
ON documents (embedding)
USING HNSW
WITH (
  metric = 'cosine',
  M = 16,
  ef_construction = 200
);

ctx.executeSync(`
  CREATE VECTOR INDEX idx_doc_embedding
  ON documents (embedding)
  USING HNSW
  WITH (metric = 'cosine', M = 16, ef_construction = 200)
`);

// Queries automatically use the index when available
// The SQL stays the same — the optimizer picks the index
const indexed = ctx.executeSync(`
  SELECT title,
         ROUND(COSINE_SIMILARITY(embedding, [0.88, 0.32, 0.12, 0.18]), 4) AS score
  FROM documents
  ORDER BY score DESC
  LIMIT 3
`);

console.log('Index-accelerated results:');
const iTitles = indexed.getColumn('title').toArray();
const iScores = indexed.getColumn('score').toArray();
iTitles.forEach((t, i) => console.log(`  ${t}: ${iScores[i]}`));

The two key HNSW parameters control the trade-off between build time, search accuracy, and memory:

M (default 16): Number of connections per node in the graph. Higher values improve recall but use more memory. Values between 12 and 48 work well for most datasets.
ef_construction (default 200): Size of the dynamic candidate list during index building. Higher values produce a better-quality graph at the cost of longer build time. Should be at least 2x the value of M.

Note: HNSW indexes are approximate. For exact results, omit the index and use brute-force scan. In practice, HNSW recall is typically above 95% with default parameters, and above 99% with tuned parameters.

Practical Example: Semantic Document Search

Here is a complete example that simulates a semantic search pipeline. In production, you would call an embedding API to vectorize both documents and queries. The SQL layer handles storage, indexing, and retrieval, while Node.js manages the application logic and API integration.

// In production, you'd call an embedding API here
async function getEmbedding(text) {
  // Simulated — returns a mock 4D embedding
  const hash = Array.from(text).reduce((h, c) => h + c.charCodeAt(0), 0);
  return [
    Math.sin(hash * 0.1) * 0.5 + 0.5,
    Math.cos(hash * 0.2) * 0.5 + 0.5,
    Math.sin(hash * 0.3) * 0.5 + 0.5,
    Math.cos(hash * 0.4) * 0.5 + 0.5,
  ];
}

async function semanticSearch(ctx, query, k = 5) {
  const queryVec = await getEmbedding(query);

  const results = ctx.executeSync(`
    SELECT id, title, category,
           ROUND(COSINE_SIMILARITY(embedding, [${queryVec}]), 4) AS relevance
    FROM documents
    ORDER BY relevance DESC
    LIMIT ${k}
  `);

  const ids = results.getColumn('id').toArray();
  const titles = results.getColumn('title').toArray();
  const cats = results.getColumn('category').toArray();
  const scores = results.getColumn('relevance').toArray();

  return ids.map((id, i) => ({
    id,
    title: titles[i],
    category: cats[i],
    relevance: scores[i],
  }));
}

// Usage
const hits = await semanticSearch(ctx, 'how do neural networks learn?', 3);
console.log(JSON.stringify(hits, null, 2));

[ { "id": 2, "title": "Deep Learning with Neural Networks", "category": "AI", "relevance": 0.9821 }, { "id": 1, "title": "Introduction to Machine Learning", "category": "AI", "relevance": 0.9654 }, { "id": 3, "title": "Natural Language Processing", "category": "AI", "relevance": 0.9312 } ]

Combining Vectors with Graph Queries

Vectors and graphs complement each other naturally. You can use vector similarity to find semantically related entities, then explore their graph relationships, or vice versa. For example, in a knowledge graph, you might find entities similar to a query via embeddings, then traverse their relationships to discover related concepts.

-- Suppose documents are nodes in a citation graph.
-- Find docs similar to a query, then find what they cite.

-- Step 1: Create a citation edge table
CREATE TABLE cites (
  src INT,
  dst INT
);

INSERT INTO cites VALUES
  (1, 4), (2, 1), (3, 1), (3, 2),
  (5, 4), (6, 5), (7, 8), (9, 10);

-- Step 2: Define a document graph
CREATE PROPERTY GRAPH doc_graph
VERTEX TABLES (documents KEY (id))
EDGE TABLES (cites KEY (src, dst)
  SOURCE documents (src) DESTINATION documents (dst));

-- Step 3: Find similar docs, then traverse citations
WITH similar_docs AS (
  SELECT id, title,
         COSINE_SIMILARITY(embedding, [0.88, 0.32, 0.12, 0.18]) AS score
  FROM documents
  ORDER BY score DESC
  LIMIT 3
)
SELECT sd.title AS source_doc,
       ROUND(sd.score, 4) AS similarity,
       cited.title AS cites_doc
FROM similar_docs sd
LEFT JOIN GRAPH_TABLE (doc_graph
  MATCH (src:documents)-[:cites]->(dst:documents)
  COLUMNS (src.id AS src_id, dst.title)
) cited ON sd.id = cited.src_id
ORDER BY sd.score DESC;

// Execute the combined vector + graph query
const combined = ctx.executeSync(`
  WITH similar_docs AS (
    SELECT id, title,
           COSINE_SIMILARITY(embedding, [0.88, 0.32, 0.12, 0.18]) AS score
    FROM documents
    ORDER BY score DESC
    LIMIT 3
  )
  SELECT sd.title AS source_doc,
         ROUND(sd.score, 4) AS similarity,
         COALESCE(cited.title, 'No citations') AS cites_doc
  FROM similar_docs sd
  LEFT JOIN GRAPH_TABLE (doc_graph
    MATCH (src:documents)-[:cites]->(dst:documents)
    COLUMNS (src.id AS src_id, dst.title)
  ) cited ON sd.id = cited.src_id
  ORDER BY sd.score DESC
`);

console.log('Similar documents and their citations:');
const srcDocs = combined.getColumn('source_doc').toArray();
const simScores = combined.getColumn('similarity').toArray();
const citedDocs = combined.getColumn('cites_doc').toArray();

srcDocs.forEach((s, i) => {
  console.log(`  ${s} (${simScores[i]}) -> cites: ${citedDocs[i]}`);
});

ctx.destroy();

Similar documents and their citations: Introduction to Machine Learning (0.9987) -> cites: Database Indexing Strategies Deep Learning with Neural Networks (0.9934) -> cites: Introduction to Machine Learning Natural Language Processing (0.9701) -> cites: Introduction to Machine Learning Natural Language Processing (0.9701) -> cites: Deep Learning with Neural Networks

← Graph Queries (SQL/PGQ) Node.js Integration →