Why store document files in Google Cloud Storage instead of the database?

Storing binary files in a relational database like PostgreSQL inflates table sizes and degrades query performance. The post describes keeping only the GCS storage path (key) in PostgreSQL while the actual files live in Google Cloud Storage, organized by entity type, entity ID, document type, version, and filename. Access is controlled through signed URLs with configurable TTLs — 1 hour for in-app viewing and 24 hours for email download links — so files are never publicly accessible even if the path is known.

Why implement document versioning from day one?

The post calls out versioning as the single most common post-go-live request: users frequently upload the wrong file and need to replace it without losing the previous copy. The system creates a new version row for each replacement and treats the highest version number as the active document, retaining all prior versions indefinitely. Since implementing versioning, the team at Commsult has never had to perform a data recovery because of a bad document upload.

How does OCR full-text search work across uploaded PDFs?

Raw PDF files are not natively searchable, so the system runs Google Cloud Vision API OCR on every uploaded PDF or image and stores the extracted text as a PostgreSQL tsvector in a dedicated document_search table. Queries use PostgreSQL's ts_query, which handles both Indonesian and English text. The post notes that an auditor's search across all invoices from a specific vendor within a date range returns results with highlighted excerpts in under a second.

How does the DMS handle Indonesian regulatory compliance document expiry?

Indonesian regulations require tracking certificates with hard expiry dates — PKP certificates, SIUP, API import licenses, SNI certifications, ISO certificates, and staff qualification certificates. The system stores an optional expires_at field per document and sends automated email alerts at 60, 30, and 7 days before expiry. The procurement module also blocks new PO creation for any vendor whose PKP certificate has lapsed, replacing a manual calendar-reminder process that frequently missed expirations.

What storage cost controls are in place for a growing document archive?

The post warns that a company archiving invoices, contracts, photos, and reports for five years can easily accumulate 500 GB to 1 TB, with Google Cloud Storage in the asia-southeast2 (Jakarta) region costing roughly $20 per month per terabyte plus egress fees. To control costs, documents older than one year are moved to Coldline storage, which is approximately 10 times cheaper for rarely accessed archives. Per-document upload size is capped at 50 MB, and per-entity storage is capped at 500 MB to prevent accidental video uploads.

Document Management in a Custom ERP: From File Uploads to Searchable Archives

According to Deloitte, 79% of business leaders report that their team's productivity is hindered by disconnection between systems. Nowhere is this truer than document management: invoices emailed as PDFs, contracts stored on someone's hard drive, technical drawings scattered across Google Drive folders. When documents live outside the ERP, they can't be linked to transactions, can't trigger workflows, and can't be found quickly during an audit. In my freelance ERP consulting work, I built an integrated document management module that links documents directly to ERP entities — contracts to vendors, invoices to AP entries, technical drawings to product records. This post covers the architecture.

Document Model and Metadata

Every document in the DMS has: document_id, entity_type (vendor, product, invoice, project, employee), entity_id (the linked ERP record), document_type (contract, invoice, certificate, drawing, photo), filename, storage_path (Google Cloud Storage key), file_size, mime_type, version (integer, starting at 1), uploaded_by, uploaded_at, and an optional expires_at (for compliance documents with expiry dates). This metadata enables: filtering documents by type and entity, expiry tracking for compliance, version history, and usage attribution.

File Storage with Google Cloud Storage

All document files are stored in Google Cloud Storage, not in the database. I store only the GCS key (path) in PostgreSQL. Files are organized by folder structure: '[entity_type]'/'[entity_id]'/'[document_type]'/'[version]'/'[filename]'. Access is via GCS signed URLs with configurable TTL: 1 hour for in-app viewing, 24 hours for download links sent via email. Signed URLs prevent direct access without ERP authentication — files are not publicly accessible even if someone knows the storage path.

ERP Document Management Architecture

  ERP Entity (vendor, invoice, project, employee)
       │  linked via entity_type + entity_id
       ▼
  ┌─────────────────────────────────────────────────────┐
  │            documents table (PostgreSQL)              │
  │                                                     │
  │  id           │ UUID                                │
  │  entity_type  │ 'vendor' | 'invoice' | 'project'   │
  │  entity_id    │ UUID (FK to entity)                 │
  │  document_type│ 'contract' | 'invoice' | 'cert'     │
  │  storage_path │ GCS key (not the file itself)       │
  │  version      │ INT (increments on replacement)     │
  │  expires_at   │ TIMESTAMPTZ (nullable)              │
  │  text_content │ tsvector (for full-text search)     │
  └──────────────────────┬──────────────────────────────┘
                         │
          ┌──────────────┼──────────────┐
          ▼              ▼              ▼
  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐
  │  GCS Storage │  │  OCR Extract │  │  Expiry Monitor  │
  │  (files)     │  │  Vision API  │  │  → alerts 60/30/ │
  │  Signed URLs │  │  → tsvector  │  │    7 days before  │
  └──────────────┘  └──────────────┘  └──────────────────┘

  Full-text search:
  SELECT * FROM documents
  WHERE text_content @@ to_tsquery('english', 'Sumber & Makmur')
  AND uploaded_at BETWEEN '2025-01-01' AND '2025-03-31';

From my experience building custom ERP systems for clients: implement document versioning from day one. The most common document management request after go-live is 'I uploaded the wrong version of the contract — can I replace it without losing the old one?' If you don't have versioning, the answer is ugly. My system creates a new version row for each replacement, retaining all previous versions. The 'active' version is the highest version number. I've never had to do a data recovery because of a bad document upload since implementing versioning.

Document Workflow Integration

Documents should trigger and participate in workflows. When a vendor uploads an NPWP certificate via the vendor portal, the document appears in the ERP procurement team's review queue. When an AP invoice is approved, the attached PDF invoice automatically archives to the vendor's document folder. When a project contract is uploaded, the system extracts the contract value and populates the project budget if the document type is CONTRACT. These workflow integrations turn the DMS from a filing cabinet into an active participant in business processes.

OCR and Full-Text Search

Raw PDF files are not searchable unless you extract their text. I use Google Cloud Vision API for OCR on uploaded PDFs and images: the text content is extracted and stored in a document_search table as a PostgreSQL tsvector. Full-text search against the documents uses PostgreSQL's ts_query, which handles Indonesian and English text. When an auditor asks 'find all invoices from PT Sumber Makmur between January and March 2025', the search runs in under a second and returns the matching documents with highlighted excerpts.

// NestJS: Document upload with GCS storage + OCR
@Post('/documents/upload')
@UseInterceptors(FileInterceptor('file'))
async uploadDocument(
  @UploadedFile() file: Express.Multer.File,
  @Body() dto: UploadDocumentDto,
  @GetUser() user: User,
) {
  // 1. Determine storage path and version
  const existing = await this.docRepo.findLatestVersion(
    dto.entityType, dto.entityId, dto.documentType
  );
  const version = (existing?.version ?? 0) + 1;
  const storagePath = [
    dto.entityType, dto.entityId, dto.documentType,
    `v${version}`, file.originalname
  ].join('/');

  // 2. Upload to GCS
  await this.storageService.upload(storagePath, file.buffer, file.mimetype);

  // 3. Extract text via Google Cloud Vision (async)
  const doc = await this.docRepo.save({
    entityType:  dto.entityType,
    entityId:    dto.entityId,
    documentType: dto.documentType,
    storagePath, version,
    fileSize:    file.size,
    mimeType:    file.mimetype,
    uploadedBy:  user.id,
    expiresAt:   dto.expiresAt,
  });

  // Trigger OCR extraction asynchronously
  await this.ocrQueue.add('extract-text', { documentId: doc.id, storagePath });

  return { documentId: doc.id, version, storagePath };
}

// OCR processor: updates tsvector for full-text search
@Process('extract-text')
async extractText(job: Job<{ documentId: string; storagePath: string }>) {
  const imageBytes = await this.storageService.download(job.data.storagePath);
  const [result] = await this.visionClient.textDetection({
    image: { content: imageBytes },
  });
  const text = result.fullTextAnnotation?.text ?? '';
  await this.docRepo.update(job.data.documentId, {
    textContent: () => `to_tsvector('english', ${JSON.stringify(text)})`,
  });
}

How I Approached This

The DMS backend is a NestJS DocumentModule with a StorageService (wraps GCS), DocumentService (metadata CRUD), and SearchService (PostgreSQL full-text). The front end is a React document viewer using react-pdf for PDF rendering and a folder tree view built with react-arborist. Documents open in a side panel without leaving the current ERP page — the user can view a contract while editing the vendor record it's linked to. I also implemented a bulk download feature that creates a ZIP archive of selected documents via streaming.

Document storage costs accumulate faster than expected. A company that stores all invoices, contracts, photos, and reports for 5 years can easily accumulate 500GB-1TB of files. Google Cloud Storage costs for 1TB in the asia-southeast2 (Jakarta) region are approximately $20/month for storage plus egress costs for downloads. Implement a document lifecycle policy: move documents older than 1 year to Coldline storage (10x cheaper for rarely accessed archives). Also enforce maximum file sizes (I cap at 50MB per document and 500MB per entity) to prevent engineers from accidentally uploading video files.

Compliance Document Tracking

Indonesian regulatory compliance requires tracking documents with expiry dates: PKP certificates (renewable annually), SIUP (business license), API (import license), SNI certifications, ISO certificates, and staff qualification certificates. The DMS sends email alerts 60 days, 30 days, and 7 days before a compliance document expires. The procurement module checks document validity before issuing POs to vendors — if a vendor's PKP certificate has expired, new PO creation is blocked. This automated compliance tracking replaced a manual calendar reminder system that frequently missed expirations.

Audit and eDiscovery Support

During an audit (tax audit by DJP, or client-required compliance audit), auditors request specific documents: all purchase invoices for a period, all contracts with a specific vendor, evidence of approval for transactions above a threshold. The DMS supports: bulk export by date range, entity type, and document type; audit trail showing who accessed each document and when; and a read-only auditor access level that grants view-only access to specific document sets without exposing the rest of the ERP. The eDiscovery export runs as a background job and emails the requester a download link when complete.

Sources & Further Reading

Frequently Asked Questions

Document Management in a Custom ERP: From File Uploads to Searchable Archives

Frequently Asked Questions

Document Management in a Custom ERP: From File Uploads to Searchable Archives

Document Model and Metadata

File Storage with Google Cloud Storage

Document Workflow Integration

OCR and Full-Text Search

How I Approached This

Compliance Document Tracking

Audit and eDiscovery Support

Document Model and Metadata

File Storage with Google Cloud Storage

Document Workflow Integration

OCR and Full-Text Search

How I Approached This

Compliance Document Tracking

Audit and eDiscovery Support