Semantic analysis between documents

Ram
2 min readFeb 18, 2022

Given 2 or more documents, understand how they are semantically related with each other. Understand how the 2 documents are related(e.g: do they cover the same topic?)and with which percentage: e.g: 20%, 50 % etc

Following are some ways how I’d go about it.

  1. Intersection between the topics of two documents

Use LDA (Latent Dirichlet Allocation) to get the top K topics in both the document and measure the overlap. More overlap = more similarity. You can further modify this implementation to compare Word2Vec vectors instead of plain words. For instance, if LDA comes up with following topics for doc1: guitar, strings, amplifier, and this for doc2: lyrics, songs, Bollywood, then the Word2vec similarity can work better than plan intersection of strings. If this works, you can marginally increase the accuracy by replacing Word2vec by something more advanced like BERT Embeddings. Since we’re using LDA you’ll be able to “interpret” the model, meaning you can look at the topics and see if they make sense (unlike option (3))

  1. Using lda2vec:

Explained really well in this post. Idea is similar to the previous post, but it adds in more context using word2vec internally. It’s a little tough to implement and has less tutorials online, but still worth a shot.

  1. Word2vec + clustering

If you’re not too hung up on interpreting the output, then this method might work just well for you. Use word2vec vectors and then cluster them. Look at how far the center of clusters are for both the documents. More far away = less similar (hint hint: cosine similarity). You won’t be able to easily interpret what the actual topics (or words) are, but I am positive the geometry would bring up some interesting results.

Bonus: found this link which uses doc2vec and clustering. This will work out good if you’re data is not domain specific. Hope this helps! Ping me if you need more help.

--

--