Hi, I'm Lindsay Vass and I created thesauropod.us in three weeks while I was an Insight Data Science Fellow. I'm an avid podcast listener myself and wanted a better way to find new podcasts to listen to. I used natural language processing techniques to calculate the similarity between podcasts based on the similarity of their content.
Under the Hood:
Data: Podcast descriptions, episode titles, and episode descriptions were scraped from the iTunes website and API and stored in a PostgreSQL database. I pre-processed the text data by tokenizing it, removing common words and phrases, and stemming each word. I also excluded any podcasts whose most recent episode was more than 45 days old. This guarantees that all the results you get on thesauropod.us are from active podcasts.
Algorithm: I performed two transformations on the word data using gensim. I first used TF-IDF to give rare words a higher value than frequent words, and then filtered my word list so that each word appeared in at least 5 podcasts but no more than 50% of podcasts. Second, I used Latent Semantic Indexing (LSI) to reduce my feature set from 50,000+ words to 100 topics. Finally, I measured the similarity between pairs of podcasts by calculating the cosine of the angle between their topic vectors.
Backend: This project is coded in Python and relies on a number of packages for the heavy lifting, including BeautifulSoup, PycURL, NLTK, pandas, and gensim. I used psycopg2 to interface with PostgreSQL.
The source code is available on GitHub.