A production-ready hybrid search engine built with Go and Python, using only free and open-source tools. Combines BM25 keyword ranking with semantic vector search via Reciprocal Rank Fusion (RRF).
┌─────────────────────────────────────────────────────────────────────┐
│ Search Engine Stack │
│ │
│ ┌─────────────────────────────────┐ ┌──────────────────────────┐ │
│ │ Python Layer │ │ Go Layer │ │
│ │ │ │ │ │
│ │ ┌──────────────────────────┐ │ │ ┌────────────────────┐ │ │
│ │ │ python-crawler :8000 │ │ │ │ go-api :8080 │ │ │
│ │ │ • Tavily, Exa, Serper │ │ │ │ • Fiber REST API │ │ │
│ │ │ • NewsAPI, GitHub, SO │ │ │ │ • Rate limiting │ │ │
│ │ │ • Unsplash, BS4 │ │ │ │ • Hybrid ranking │ │ │
│ │ │ • Playwright (JS pages) │ │ │ │ • Redis cache │ │ │
│ │ └────────────┬─────────────┘ │ │ └────────┬───────────┘ │ │
│ │ │ Redis Queue │ │ │ OpenSearch │ │
│ │ ┌────────────▼─────────────┐ │ │ ┌────────▼───────────┐ │ │
│ │ │ python-nlp :8001 │ │ │ │ go-indexer :8081 │ │ │
│ │ │ • spaCy NER │ │ │ │ • Bulk indexing │ │ │
│ │ │ • NLTK tokenisation │───┼──┼─▶│ • Postgres meta │ │ │
│ │ │ • all-MiniLM embeddings │ │ │ │ • Queue consumer │ │ │
│ │ │ • HTML cleaning │ │ │ └────────────────────┘ │ │
│ │ └──────────────────────────┘ │ └──────────────────────────┘ │
│ └─────────────────────────────────┘ │
│ │
│ ┌──────────────────┐ ┌───────────────────┐ ┌─────────────────┐ │
│ │ OpenSearch :9200│ │ PostgreSQL :5432 │ │ Redis :6379 │ │
│ │ • kNN vectors │ │ • Doc metadata │ │ • Work queues │ │
│ │ • BM25 search │ │ • Search logs │ │ • Result cache │ │
│ │ • Inverted idx │ │ • Crawl jobs │ │ • Visited URLs │ │
│ └──────────────────┘ └───────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
API Keys / Seed Query
│
▼
python-crawler ──── fetches from Tavily/Exa/Serper/NewsAPI/GitHub/SO/Unsplash
│ scrapes URLs with BS4 or Playwright
│ Redis queue: "queue:crawled"
▼
python-nlp ──────── cleans HTML → tokenises → NER → embeds (384-dim)
│
│ Redis queue: "queue:processed"
▼
go-indexer ──────── bulk writes to OpenSearch + Postgres metadata
│
▼
go-api ──────────── GET /api/v1/search?q=...&mode=hybrid
├── keyword search (OpenSearch BM25/multi_match)
├── semantic search (OpenSearch kNN)
└── Reciprocal Rank Fusion → ranked results
- Docker 24+ and Docker Compose v2
- 4 GB RAM minimum (8 GB recommended — NLP model is ~500 MB)
git clone https://github.com/your-org/search-engine.git
cd search-engine
# Create your .env file
cp configs/.env.example .env
# Edit .env and add your API keys (see "Free API Keys" section below)cd docker
docker compose up --buildFirst start downloads the NLP model (~500 MB). Subsequent starts are fast.
curl http://localhost:8080/health
curl http://localhost:8000/health
curl http://localhost:8001/health
curl http://localhost:8081/health# Crawl documents from enabled APIs
curl -X POST http://localhost:8000/crawl \
-H "Content-Type: application/json" \
-d '{"query": "machine learning", "sources": ["github", "stackoverflow"], "limit": 5}'Documents flow automatically:
crawler → Redis → NLP → Redis → Indexer → OpenSearch
# Hybrid search (default)
curl "http://localhost:8080/api/v1/search?q=machine+learning"
# Keyword only (BM25)
curl "http://localhost:8080/api/v1/search?q=neural+networks&mode=keyword"
# Semantic only (vector similarity)
curl "http://localhost:8080/api/v1/search?q=deep+learning+transformers&mode=semantic"
# With filters
curl "http://localhost:8080/api/v1/search?q=golang&source=github&page=2&page_size=5"| Parameter | Type | Default | Description |
|---|---|---|---|
q |
string | required | Search query |
mode |
string | hybrid |
keyword / semantic / hybrid |
page |
int | 1 |
Page number |
page_size |
int | 10 |
Results per page (max 50) |
source |
string | Filter by source (github, newsapi, etc.) |
|
language |
string | Filter by language (en, fr, etc.) |
|
date_from |
string | ISO date filter from | |
date_to |
string | ISO date filter to | |
min_score |
float | 0 |
Minimum relevance score |
Response:
{
"query": "machine learning",
"total": 42,
"page": 1,
"page_size": 10,
"total_pages": 5,
"mode": "hybrid",
"duration_ms": 48,
"hits": [
{
"id": "a3f9b1c2d4e5",
"url": "https://github.com/example/ml-repo",
"title": "Machine Learning Framework",
"snippet": "…a flexible machine learning framework for production use…",
"score": 0.0312,
"keyword_score": 8.42,
"semantic_score": 0.91,
"source": "github",
"published_date": "2024-03-15T00:00:00Z",
"keyphrases": ["machine learning", "neural networks"]
}
]
}Retrieve a full document by its ID.
Service health check including downstream dependencies.
Prometheus metrics endpoint.
{
"query": "golang search engine",
"sources": ["tavily", "github", "stackoverflow", "newsapi"],
"limit": 10,
"crawl_urls": true
}{
"urls": ["https://example.com", "https://example.org"],
"concurrency": 5
}Returns queue depths and visited URL count.
{
"documents": [
{"url": "https://example.com", "title": "Example", "content": "Raw text..."}
]
}{"text": "your query or document text"}Returns {"embedding": [...384 floats...], "dimension": 384}
Directly index pre-processed documents (bypasses queue):
{
"documents": [
{
"id": "abc123",
"url": "https://example.com",
"title": "Example Document",
"content": "Document content...",
"tokens": ["document", "content"],
"embedding": [0.1, 0.2, ...],
"word_count": 2,
"language": "en",
"source": "manual"
}
]
}All external data sources use free tiers — no credit card required:
| Service | Free Tier | Sign-up URL |
|---|---|---|
| Tavily | 1,000 req/month | https://tavily.com |
| Exa | 1,000 req/month | https://exa.ai |
| Serper.dev | 2,500 queries free | https://serper.dev |
| NewsAPI | 100 req/day (developer) | https://newsapi.org |
| GitHub | 5,000 req/hr (with token) | https://github.com/settings/tokens |
| StackExchange | No key needed (60 req/hr) | https://stackapps.com |
| Unsplash | 50 req/hr (demo) | https://unsplash.com/developers |
The system works with zero API keys — GitHub and StackOverflow work unauthenticated. Add keys to unlock higher rate limits.
search-engine/
├── services/
│ ├── python-crawler/ # Data ingestion service
│ │ ├── crawler/ # BS4, Playwright, robots.txt
│ │ ├── sources/ # One file per API source
│ │ ├── tests/
│ │ ├── main.py # FastAPI app
│ │ ├── requirements.txt
│ │ └── Dockerfile
│ │
│ ├── python-nlp/ # NLP processing service
│ │ ├── nlp/ # cleaner, tokenizer, embedder, NER
│ │ ├── tests/
│ │ ├── main.py # FastAPI app + queue worker
│ │ ├── requirements.txt
│ │ └── Dockerfile
│ │
│ ├── go-indexer/ # Document indexing service
│ │ ├── internal/
│ │ │ ├── indexer/ # OpenSearch bulk indexing
│ │ │ ├── database/ # Postgres pool + migrations
│ │ │ ├── models/ # Shared data types
│ │ │ └── queue/ # Redis consumer
│ │ ├── tests/
│ │ ├── main.go
│ │ ├── go.mod
│ │ └── Dockerfile
│ │
│ └── go-api/ # Search API service
│ ├── internal/
│ │ ├── handlers/ # search, document, health
│ │ ├── ranking/ # BM25, semantic, hybrid RRF
│ │ ├── middleware/ # rate limiting, logging, recovery
│ │ ├── cache/ # Redis result cache
│ │ └── models/ # Request/response types
│ ├── tests/
│ ├── main.go
│ ├── go.mod
│ └── Dockerfile
│
├── shared/
│ ├── configs/
│ │ └── prometheus.yml
│ └── scripts/
│ ├── init_db.sql # Postgres schema
│ ├── setup.sh # First-run setup
│ └── integration_test.sh # Live stack smoke tests
│
├── configs/
│ └── .env.example # All configurable variables
│
└── docker/
└── docker-compose.yml # Full stack definition
cd services/go-api
go test ./tests/... -v
cd services/go-indexer
go test ./tests/... -vcd services/python-nlp
pip install pytest pytest-asyncio
pytest tests/ -v
cd services/python-crawler
pytest tests/ -vbash shared/scripts/integration_test.shcd docker
docker compose --profile monitoring up -d- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin / admin)
cd docker
docker compose --profile dashboards up -d- OpenSearch Dashboards: http://localhost:5601
All settings are controlled via environment variables (see configs/.env.example).
| Variable | Default | Description |
|---|---|---|
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Sentence-Transformers model name |
NLP_BATCH_SIZE |
16 |
Documents per embedding batch |
MAX_TEXT_CHARS |
8000 |
Max chars per document for NLP |
RATE_RPS |
20 |
API rate limit (requests/second/IP) |
RATE_BURST |
50 |
Rate limit burst size |
USE_PLAYWRIGHT |
false |
Enable JS-rendered page crawling |
- The NLP service runs a background queue worker consuming
queue:crawledand producingqueue:processed. No additional orchestration needed. - OpenSearch is configured with
knn_vectormapping (384-dim) for semantic search. Changing the embedding model requires re-indexing. - Redis is used for three purposes: work queues, visited URL deduplication (SET), and search result caching (JSON with 5-min TTL).
- BM25 ranking within OpenSearch is augmented by field boosting (
title^3,keyphrases^2,content^1). - Hybrid search uses Reciprocal Rank Fusion (k=60), which is robust without requiring score normalisation.