/img/content-blog-raw-blog-semantic-similarity-untitled.png

Introduction

Deliverable - Two paragraph-level distance outputs for L and Q, each has 35 columns.

For each paragraph, we need to calculate the L1 distance of consecutive sentences in this paragraph, and then generate the mean and standard deviation of all these distances for this paragraph. For example, say the paragraph 1 starts from sentence1 and ends with sentence 5. First, calculate the L1 distances for L1(1,2), L1(2,3), L1(3,4) and L1(4,5) and then calculate the mean and standard deviation of the 4 distances. In the end we got two measures for this paragraph: L1_m and L1_std. Similarly, we need to calculate the mean and standard deviation using L2 distance, plus a simple mean and deviation of the distances. We use 6 different embeddings: all dimensions of BERT embeddings, 100,200 and 300 dimensions of PCA Bert embeddings (PCA is a dimension reduction technique

In the end, we will have 35 columns for each paragraph : Paragraph ID +#sentences in the paragraph +(cosine_m, cosine_std,cossimillarity_m, cosimmilarity_std, L1_m, L1_std, L2_m, L2_std ) – by- ( all, 100, 200, 300)= 3+8*4.

Note: for paragraph that only has 1 sentence, the std measures are empty.

Modeling Approach

Process Flow for Use Case 1

Splitting paragraphs into sentences using 1) NLTK Sentence Tokenizer, 2) Spacy Sentence Tokenizer and, on two additional symbols : and ...
Text Preprocessing: Lowercasing, Removing Non-alphanumeric characters, Removing Null records, Removing sentence records (rows) having less than 3 words.
TF-IDF vectorization
LSA over document-term matrix
Cosine distance calculation of adjacent sentences (rows)

Process Flow for Use Case 2

Split paragraphs into sentences
Text cleaning
BERT Sentence Encoding
BERT PCA 100
BERT PCA 200
BERT PCA 300
Calculate distance between consecutive sentences in the paragraph
Distances: L1, L2 and Cosine and Cosine similarity
Statistics: Mean, SD

Experimental Setup

#IncrementalPCA
GPU to speed up
Data chunking
Calculate BERT for a chunk and store in disk

Introduction

Modeling Approach

Process Flow for Use Case 1​

Process Flow for Use Case 2​

Experimental Setup

Process Flow for Use Case 1

Process Flow for Use Case 2