BrainWave: Semantic Search trained on ProductHunt and Crunchbase

Nov 06, 2024

If you’re like me, you have 10 new ideas per week.

Ideas by themselves aren’t worth much. They need to be validated, then executed relentlessly.

Last summer, I built a tool that can help you validate ideas quickly by checking if your idea already exists using a Semantic Analysis of over 3.5M companies & products indexed from ProductHunt and Crunchbase. It’s called BrainWave.

It allows you to find all startups working on an idea using Al vectorial search. It’s market research on steroid for founders & investors.

Here’s how it works

Instead of performing a traditional keyword search on Google, or ProductHunt - which will only return exact matches for these keywords, you can describe your idea to BrainWave, and it will find every idea resembling yours… even if it was indexed using different keywords. It’s called semantic search.

Note: the existence of competitors doesn’t mean you shouldn’t work on an idea, most successful companies already had competitors when they started (MySpace before Facebook, AltaVista before Google). But having a complete overview of your competitors can help you refine the idea and find an angle in your execution that might be missing from existing solutions. Creativity has a lot to do with remixing ideas.

I like to say it’s like being the Professor X of startups. Imagine knowing who is working on anything just by thinking about it. Telepathy made real.

Here’s how I built it

1. Obtaining the datasets

That was the hardest part. It took me a little while to find these resources, which were actually public and free, but not so easy to find.

ProductHunt has an API.
Crunchbase has a limited dataset you can download for free.

I had to write a few python scripts to loop on the PH API and properly parse the Crunchbase CSV. ChatGPT was useful in helping me write that faster. And you know, the common sense stuff like respecting query limits. In total, it took about 3 days for everything to be pulled. I ended up with a giant CSV containing millions of companies and products.

2. Indexing it all in a vectorial database

I used Pinecone to host my vectors and OpenAI’s text-embedding-3-small model to generate the embeddings. The process is simple:

For each line in my CSV:

generate embeddings for the product/startup description & name.
- By simply calling the OpenAI API. It’s cheap and fast.
- > In this step, we’re simply asking OpenAI’s API to give us a vector (an array with 1536 numbers or “dimensions”) representing the mathematical “meaning” of the text we’re sending it. This is what we call an “embedding.” Each dimension contributes to capturing the unique nuance of the text’s meaning.
storing the embedding received into a vectorial database.
- Now we simply take the result and send it to Pinecone using their API.
- > In this step, Pinecone will simply receive our embedding, and index it in their database already optimized for semantic search. Many other solutions support vectorial search including ElasticSearch, MongoDB Atlas, and a bunch of open-source solutions, but I chose Pinecone for the simplicity. Indexing everything cost me $40.

3. Building the frontend on Cloudflare Workers

I built a simple static website, hosted it on Cloudflare Workers for low maintenance. The page is simple HTML with a basic CSS framework, and vanilla JS functions (old school I know) that call a proxy endpoint in my worker which essentially runs the user’s query through the same process: transform query into embeddings with OpenAI’s API > query Pinecone’s API for nearest matches.

Why do we have to call OpenAI’s API again? Because our Pinecone DB only understands vectors. To search our database, we search a vector. So we need to translate the user’s query into a vector using OpenAI before sending it to Pinecone. Luckily, both APIs are fast so the synchronous chaining of the 2 queries doesn’t take more than 500ms.

Here’s what it means for creativity

Imagine a world where your creative process is augmented by AIs that have indexed the entirety of humanity’s creation and can dynamically inform you of the most pertinent ideas to your creative flow. This would make Google as we know it look like a dinosaur.

Ultimately, that’s one of the greatest use cases I see for AI assisting humanity. It’s not so much in replacing us in the creative process, nor in making us dependent to it to think originally, but by helping us filter and search through the ever-growing Web of collective consciousness to better understand the status quo & improve it.

Since so much of creativity is about remixing ideas - understanding what’s out there can be a big deal. I built this for startup founders and investors. But similar applications of vectorial databases could be valuable for writers, researchers, musicians,… ultimately any creator in any field.

Check it out → brainwave.vc & tell me what you think

Louison’s newsletter

Discussion about this post