GUIDES

From CSV to Insights

A practical walkthrough: load a CSV file, explore the data, filter rows, aggregate results, and extract insights — all from Node.js.

This guide takes you through a realistic data analysis workflow using TeideDB's Node.js bindings. We will start with a raw CSV file containing sales records, then progressively build up queries to filter, aggregate, rank, and export results. By the end, you will have a complete script that transforms raw data into actionable business insights.

Loading a CSV File

TeideDB can ingest CSV files directly into in-memory columnar tables using the read_csv function. This parses the file, infers column types, and stores the data in a compressed columnar format optimized for analytical queries. The entire operation runs on the dedicated Teide thread, so the V8 event loop stays responsive even for large files.

Assume we have a file called sales.csv with columns: id, date, category, product, price, quantity, and region.

import { Context } from 'teide-js';

const ctx = new Context();

// Load the CSV into a table named "sales"
ctx.executeSync(`
  CREATE TABLE sales AS
  SELECT * FROM read_csv('sales.csv')
`);

console.log('Table loaded successfully.');

Note: The file path passed to read_csv is resolved relative to the current working directory of the Node.js process, not relative to the script file. Use an absolute path if your script may be invoked from different directories.

Exploring the Data

Before diving into analysis, it is good practice to understand the shape of your data. How many rows does the table contain? What do the first few records look like? TeideDB supports standard SQL for exploration, so you can use familiar queries to get oriented. The executeSync method returns a Table object that you can inspect column by column.

-- Count total rows
SELECT COUNT(*) AS total_rows FROM sales;

-- Preview first 5 rows
SELECT * FROM sales LIMIT 5;

// Row count
const countResult = ctx.executeSync('SELECT COUNT(*) AS total_rows FROM sales');
const totalRows = countResult.getColumn('total_rows').toArray()[0];
console.log(`Total rows: ${totalRows}`);

// Preview
const preview = ctx.executeSync('SELECT * FROM sales LIMIT 5');
for (const name of preview.columnNames()) {
  console.log(`  ${name}: ${preview.getColumn(name).toArray()}`);
}

Total rows: 10000 id: [1, 2, 3, 4, 5] date: ['2024-01-03', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-05'] category: ['Electronics', 'Books', 'Electronics', 'Clothing', 'Books'] product: ['Laptop', 'Novel', 'Headphones', 'Jacket', 'Textbook'] price: [999.99, 14.99, 79.95, 129.00, 54.50] quantity: [1, 3, 2, 1, 2] region: ['West', 'East', 'West', 'South', 'North']

Filtering Rows

With a feel for the data, let us narrow down to a specific segment. Suppose we want to focus on premium electronics — items in the "Electronics" category priced above $100. SQL's WHERE clause handles this naturally. TeideDB pushes predicates down to the columnar scan, so only matching rows are materialized, keeping memory usage low even on large datasets.

SELECT product, price, quantity, price * quantity AS line_total
FROM sales
WHERE category = 'Electronics'
  AND price > 100
ORDER BY line_total DESC
LIMIT 10;

const electronics = ctx.executeSync(`
  SELECT product, price, quantity, price * quantity AS line_total
  FROM sales
  WHERE category = 'Electronics'
    AND price > 100
  ORDER BY line_total DESC
  LIMIT 10
`);

const products = electronics.getColumn('product').toArray();
const totals = electronics.getColumn('line_total').toArray();

products.forEach((p, i) => {
  console.log(`  ${p}: $${totals[i].toFixed(2)}`);
});

Laptop: $2999.97 Laptop: $1999.98 Monitor: $1349.95 Laptop: $999.99 Monitor: $899.97 Tablet: $599.98 Tablet: $499.99 Monitor: $449.99 Smartwatch: $349.98 Smartwatch: $249.99

Aggregating Results

Individual rows tell a story, but aggregate statistics reveal the bigger picture. Let us compute summary metrics for each product category: total revenue, average price, and number of transactions. The GROUP BY clause partitions rows by category, and aggregate functions (SUM, AVG, COUNT) collapse each partition into a single summary row. TeideDB evaluates aggregates in a single pass over the columnar data using vectorized operations.

SELECT
  category,
  COUNT(*)            AS num_transactions,
  SUM(price * quantity) AS total_revenue,
  AVG(price)          AS avg_price
FROM sales
GROUP BY category
ORDER BY total_revenue DESC;

const summary = ctx.executeSync(`
  SELECT
    category,
    COUNT(*)              AS num_transactions,
    SUM(price * quantity) AS total_revenue,
    AVG(price)            AS avg_price
  FROM sales
  GROUP BY category
  ORDER BY total_revenue DESC
`);

const cats = summary.getColumn('category').toArray();
const revenue = summary.getColumn('total_revenue').toArray();
const avgPrice = summary.getColumn('avg_price').toArray();
const txns = summary.getColumn('num_transactions').toArray();

cats.forEach((cat, i) => {
  console.log(
    `  ${cat}: $${revenue[i].toFixed(2)} revenue, ` +
    `${txns[i]} transactions, $${avgPrice[i].toFixed(2)} avg price`
  );
});

Electronics: $487,320.50 revenue, 2841 transactions, $324.17 avg price Clothing: $198,450.00 revenue, 2530 transactions, $78.42 avg price Books: $62,180.75 revenue, 2415 transactions, $25.75 avg price Home: $145,890.25 revenue, 2214 transactions, $65.93 avg price

Finding Top Products

To surface the most impactful products, we can combine aggregation with sorting and limiting. This is a common pattern for "top-N" reports: group, compute a metric, sort descending, and take the first few rows. The query below finds the five products generating the most total revenue across all categories.

SELECT
  product,
  category,
  SUM(price * quantity) AS total_revenue,
  SUM(quantity)         AS units_sold
FROM sales
GROUP BY product, category
ORDER BY total_revenue DESC
LIMIT 5;

const topProducts = ctx.executeSync(`
  SELECT
    product,
    category,
    SUM(price * quantity) AS total_revenue,
    SUM(quantity)         AS units_sold
  FROM sales
  GROUP BY product, category
  ORDER BY total_revenue DESC
  LIMIT 5
`);

console.log('Top 5 products by revenue:');
const names = topProducts.getColumn('product').toArray();
const rev = topProducts.getColumn('total_revenue').toArray();
const units = topProducts.getColumn('units_sold').toArray();

names.forEach((name, i) => {
  console.log(`  ${i + 1}. ${name} — $${rev[i].toFixed(2)} (${units[i]} units)`);
});

Top 5 products by revenue: 1. Laptop — $245,980.50 (258 units) 2. Monitor — $98,450.25 (215 units) 3. Tablet — $72,340.00 (189 units) 4. Jacket — $45,120.00 (350 units) 5. Smartwatch — $38,990.75 (178 units)

Combining Queries for Deeper Analysis

Real analysis often requires combining multiple operations. Subqueries let you use the result of one query as input to another. In this example, we find each region's best-selling category by first computing per-region, per-category revenue, then filtering to only the top category in each region using a correlated subquery. This kind of nested analysis is where SQL really shines compared to procedural code.

SELECT r.region, r.category, r.total_revenue
FROM (
  SELECT
    region,
    category,
    SUM(price * quantity) AS total_revenue
  FROM sales
  GROUP BY region, category
) r
WHERE r.total_revenue = (
  SELECT MAX(sub.total_revenue)
  FROM (
    SELECT region, category, SUM(price * quantity) AS total_revenue
    FROM sales
    GROUP BY region, category
  ) sub
  WHERE sub.region = r.region
)
ORDER BY r.total_revenue DESC;

const regionalBest = ctx.executeSync(`
  SELECT r.region, r.category, r.total_revenue
  FROM (
    SELECT region, category, SUM(price * quantity) AS total_revenue
    FROM sales
    GROUP BY region, category
  ) r
  WHERE r.total_revenue = (
    SELECT MAX(sub.total_revenue)
    FROM (
      SELECT region, category, SUM(price * quantity) AS total_revenue
      FROM sales
      GROUP BY region, category
    ) sub
    WHERE sub.region = r.region
  )
  ORDER BY r.total_revenue DESC
`);

console.log('Top category per region:');
const regions = regionalBest.getColumn('region').toArray();
const bestCats = regionalBest.getColumn('category').toArray();
const bestRev = regionalBest.getColumn('total_revenue').toArray();

regions.forEach((region, i) => {
  console.log(`  ${region}: ${bestCats[i]} ($${bestRev[i].toFixed(2)})`);
});

Top category per region: West: Electronics ($156,420.30) East: Electronics ($142,890.20) South: Clothing ($68,230.00) North: Electronics ($98,750.50)

Exporting Results

Once you have computed your insights, you will typically want to export the data for use elsewhere — as JSON for an API response, as CSV for a spreadsheet, or as typed arrays for further numerical processing. TeideDB's zero-copy Series accessors give you direct access to the underlying columnar buffers as JavaScript TypedArrays, making export both fast and memory-efficient.

// Export as JSON array
const result = ctx.executeSync(`
  SELECT category, SUM(price * quantity) AS revenue
  FROM sales
  GROUP BY category
  ORDER BY revenue DESC
`);

const columns = result.columnNames();
const numRows = result.getColumn(columns[0]).length;
const rows = [];

for (let i = 0; i < numRows; i++) {
  const row = {};
  for (const col of columns) {
    row[col] = result.getColumn(col).toArray()[i];
  }
  rows.push(row);
}

console.log(JSON.stringify(rows, null, 2));

// Clean up when done
ctx.destroy();

[ { "category": "Electronics", "revenue": 487320.50 }, { "category": "Clothing", "revenue": 198450.00 }, { "category": "Home", "revenue": 145890.25 }, { "category": "Books", "revenue": 62180.75 } ]

Note: Always call ctx.destroy() when you are finished to release the native heap and shut down the Teide background thread. In TypeScript, you can use using ctx = new Context() for automatic cleanup via Symbol.dispose.

← Tutorial Joins, Windows & CTEs →