01. Corydora - the AI Fishkeeping Copilot

March 1, 2024

Tonight I started on the world’s first "AI Fishkeeping Copilot", lovingly named Corydora.

The plan is to build a Q&A chatbot with DSPy, and share my journey along the way. This is the first of what will hopefully be a several part series, mostly comprised of my notes/thoughts of how I’m approaching things.

I might share the code at the end, so folks can make their own DSPy chatbots, but I will not be sharing the data, as I do not have permission to modify or redistribute the content I’m using. This is purely for personal entertainment and research purposes.

But first, I need to scrape and generate embeddings for a bunch of authoritative fishkeeping content!

To start, I’m going to use the authoritative help-center and blog content from AquariumCoop.com , which is where I learned most of my fishkeeping knowledge from. Their YouTube channel is great.

Anyway, I’m not showing off the code in this post, but I managed to scrape a bunch of text content, generate embeddings, and get it all indexed into pinecone.

My favorite part of projets like this, is when I can take all of the embedding vectors and visualize them in 3d space. So here’s a quick video of me poking around the TensorFlow Projector tool.

A note on scraping text content from the web

I originally started out with a crazy scrapy system, but it was totally overkill and the results weren’t amazing— lots of text/data quality issues.

After some research I discovered @adbarbaresi's excellent Trafilatura library.

I’m using it to parse all of the urls from a series of sitemap.xml files, it then iterates over everything, parses out interesting metadata, and then I have it dump the data out as a json row in a jsonl file.

With very limited effort (like 2 functions and a for loop), I was able to use Trafilatura to grab all of the text content from 524 pages.

Then I used LangChain’s SemanticChunking function. This idea of semantic chunking is super cool to me; using embeddings of words and sentences to discern when coherent semantically meaningful chunks of text are split. The idea is that chunking your text by semantic meaning leads to embeddings that are more meaningfully clustered.

The idea, as far as I’m aware, was originated by Greg Kamradt in this awesome “5 Levels of Text Splitting” notebook

It’s worth watching his walkthrough of the notebook on youtube

A note on scraping text content from the web

Weekly Newsletter