The structure of a discipline condensates itself in texts. Of those texts, classics, on the one hand and encyclopedias on the other have an important role as they are frequently used as teaching resources. Investigating their structure is therefore of quite some interest.
In this notebook we will have a look at the Stanford Encyclopedia of Philosophy, a formidable resource that contains, at the time of writing ~ 1600 articles.
To learn how it represents the structure of philosophy, I used some techniques borrowed from machine learning. If you are interested in the details, I have put the code below.
The basic idea is simple. Every article is represented in a bag of words model, which means that all the words in it are taken out of their context and the number of their occurences is counted. These wordcounts can now be used to calculate a a similarity-metric, called cosine similarity, between all texts. Texts that use the same words are similar, those that do not, are not.
These similarities can now be flattened down (or embedded) into a two-dimensional space using a pretty new and very useful algorithm called umap. We do this to get a nice visualization of the groups in our data, as we can see above. Then we use a clustering method called hdbscan to color the points that form the groups with the highest density, and plot everything with plotly. Points that were not asssigned a cluster are left light-grey.
We can clearly make out sensible groups. In red on the right side, we find a large cluster of classical history of philosophy. On the far left of the graph we find a cluster of articles on logic, colored green. There are also some smaller clusters, like philosophy of religion at (x=15,y=14), colored dark blue, feminism at (16, 18.5) or Chinese & Indian philosophy (18,19). And at (16,18) we have the large field of political philosophy. But there is a lot more to explore: hover your mouse over the points to see the titles of the articles, or click-and-drag to select a window to zoom in.
And here as promised is the code. We start by importing some stuff:
import pandas as pd import numpy as np from random import randint import datetime %matplotlib inline import seaborn as sns import matplotlib.pyplot as plt #For Tables: from IPython.display import display pd.set_option('display.max_columns', 500) #For R (ggplot2) %load_ext rpy2.ipython from sklearn.feature_extraction.text import TfidfVectorizer,TfidfTransformer,CountVectorizer from sklearn import datasets from glob import glob from sklearn.preprocessing import LabelEncoder from sklearn.datasets import load_files from scipy import sparse
Now, let us load the textual data:
texts = load_files("./trainingdata", description=None, #categories=categories, load_content=True, encoding='utf-8', shuffle=False)#, random_state=42)
count_vect = CountVectorizer(stop_words="english",ngram_range=(1,2), binary=True, min_df = 10, max_df = 1000) X = count_vect.fit_transform(texts.data) # tfidf_transformer = TfidfTransformer() # X = tfidf_transformer.fit_transform(X)
Embed with umap:
import umap embedding = umap.UMAP(n_neighbors=5,#small => local, large => global: 5-50 min_dist=0.001, #small => local, large => global: 0.001-0.5 metric='cosine').fit_transform(X) embedding = pd.DataFrame(embedding) embedding.columns = ['x','y'] plt.scatter(embedding['x'], embedding['y'], color='grey') embedding["example"] =texts.target_names
Cluster with hdbscan:
import hdbscan clusterer = hdbscan.HDBSCAN(min_cluster_size=25,min_samples=15,gen_min_span_tree=True) clusterer.fit(embedding[["x","y"]]) XCLUST = clusterer.labels_ clusternum = len(set( clusterer.labels_))-1 #samples.append(clusternum) dfclust = pd.DataFrame(XCLUST) dfclust.columns = ['cluster'] print(clusternum) embeddingC = pd.concat([embedding,dfclust], axis=1, join_axes=[embedding.index]) # display(embeddingC)
And produce the plotly-graph you can see at the top:
%%R -i embeddingC #-o myPal means <- aggregate(embedding[,c("x","y")], list(embeddingC$cluster), median) means <- data.frame(means) n=nrow(means) means <- means[-1,] #Make the colors: mycolors <- c("#293757","#568D4B","#D5BB56","#D26A1B","#A41D1A") #Gene Davis # mycolors <- c("#c03728","#919c4c","#fd8f24","#f5c04a","#e68c7c","#00666b","#142948","#6f5438") pal <- colorRampPalette(sample(mycolors)) s <- n-1 myGray <- c('#95a5a6') myNewColors <- sample(pal(s)) myPal <- append(myGray,myNewColors) library(plotly) p <- plot_ly( type = 'scatter', mode='markers', x=embeddingC$x, y=embeddingC$y, color=as.factor(embeddingC$cluster),colors=myPal, text=embedding$example, hoverinfo="text" , marker=list( size=8, opacity=0.4)) %>% layout( margin = list(l = 50, r = 50, b = 50, t = 80, pad = 4), #font = t, title = '<b>Stanford Encyclopedia - umap embedded </b> <br><span style="font-size: 16px !important;">...based on the code by McInnes, Healy (2018)</span>', xaxis = list(title = 'umap-x', zerolinecolor = toRGB("lightgray")), yaxis = list(title = 'umap-y', zerolinecolor = toRGB("lightgray")))%>% config(displayModeBar = F) htmlwidgets::saveWidget(as_widget(p),selfcontained = TRUE, "graph.html")
McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017
McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018