In this jupyter notebook we will investigate the macro-structure of philosophical literature, and at the end we will produce the graphic above. As a base for our investigation I have collected about fifty-thousand records from the Web of Science collection, spanning from the forties to this very day. The question of how philosophy is structured has received quite a lot of attention, albeit usually under more specific formulations, for example when people ask whether there is a divide between "analytic" and "continental" philosophy, or whether philosophy is divided more along the lines of "naturalism" and "anti-naturalism". I think that the toolkit I provide below allows to give answers to these questions. As it is purely data-driven, it is free from most prior assumptions about how philosophy is structured. And because it encompasses a rather large sample of philosophical literature, it should guide us to a point of view, that is cleared from the personal expectations that we have from our own intellectual histories.
Let us begin by loading some libraries that we will need. Here I want to draw attention to the great metaknowledge-package that makes the handling of WOS-Files so much easier.
import metaknowledge as mk
import pandas as pd
import numpy as np
from random import randint
import datetime
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
# plt.rcParams['figure.dpi']= 130
# from matplotlib.pyplot import figure
# plt.style.use('fivethirtyeight')
# sns.set_context('poster')
# sns.set_style('white')
# #sns.set_color_codes()
# plot_kwds = {'alpha' : 0.25, 's' : 10, 'linewidths':0}
#For Tables:
from IPython.display import display
pd.set_option('display.max_columns', 500)
#For R (ggplot2)
%load_ext rpy2.ipython
The WOS-Files where collected with a threefold snowball-sampling strategy. I started out with eight pretty different journals:
These should provide a pretty diverse entry-point into philosophy. Of each journal I collected the 500 most cited papers, and analysed, which journals these papers cited. Of those journals I evaluated the citations of the top 500 papers of the 30 most cited journals (a total of around 15000 papers). Thus I arrived at a list of fifty philosophy journals, that should, by every sensible criterion, be a good aproximation for whatever philosophers have been interested in. I downloaded for each of this journals every record in the WOS-Database, arriving at collection of 54491 records.
Now, without further ado, lets load our raw data, filter out incomplete records and print a little summary of it.
date_string = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M")
RC = mk.RecordCollection("datanew")
print(RC.glimpse())
RC2 = mk.RecordCollection()
for R in RC:
randnr = randint(0, 4)
if len(R.getCitations().get("author"))>=3: # and randnr==0 apply condition in order to downsample records
#Here we kick out every paper that cites less then 3 authors. Why? because they
#are so dissimilar from the others, that they only produce noise.
try:
R['year']
#R['abstract'] #Add this when working with abstracts. It removes every paper that has none.
#This can sometimes remove whole journals, that are archived without abstracts, so handle with care.
RC2.add(R)
except KeyError:
pass
else:
pass
print(RC2.glimpse())
RC = RC2
drc = pd.DataFrame.from_dict(RC.forNLP(extraColumns=['journal','AU']))
#Build a dataframe of bibliographic data for later use.
The table above gives us some statistics about the data we are working with. As you can see, we had to remove around 6000 records for reasons of missing data, but that should be fine. The summaries show us the most prolific authors, the journals with the most occurences and the most cited single works. All of this makes sense so far. We have the incredibly popular David Lewis with multiple mentions in the the the top cited works, along with some other very well known recent authors, and of course the most influential of classics, Aristotle, Hume & Kant.
Now we have to decide what structure we are interested in. We have quite a lot of possibilities to work with in the data. Below I have supplied code for three: Clustering by specific cited works, clustering by the names of cited authors, and clustering by the content of the abstracts. Here we will look only at the first: the clusters that are generated through the analysis of the citations of specific works. Formally, we are looking for cocitation communities of papers. In real live, this should translate to something between a tradition, in which people cite the intellectual heroes they have in common, and a thematic grouping, where people that are interested in a subject cite other people that have worked on that subject. This approach has the advantage of a certain granularity, as it allows for example to differentiate between papers that cite Nozick on political theory and those that are interested his work on epistemology.
########### Clustering by Cited Works ############
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re
d = []
citedAU = []
citestring =[]
for R in RC:
d.append(list(set(R.getCitations().get("citeString")))) #To cluster by cited author
citedAU.append(list(set(R.getCitations().get("author"))))
citestring.append(list(set(R.getCitations().get("citeString"))))
drc["citedAU"] = citedAU
drc["citestring"] = citestring
authorslist = ['§'.join(filter(None,x)) for x in list(d)]
vec = TfidfVectorizer(token_pattern=r'(?<=[^|§])[\s\w,\.:;]+(?=[$|§])')
Xrc = vec.fit_transform(authorslist)
#display(pd.DataFrame(Xrc.toarray(), columns=vec.get_feature_names()).transpose()) #To look into the vectors. Beware, can take a bit of RAM
Now we conduct a dimensionality-reduction of the data, to make what follows faster. Afterwards we make a little plot in R. In our case of the large dataset this is not that interesting, but if we have small data that is very diverse, e. g. a few hundred papers from very different disciplines, the SVD can already make out some structure.
from sklearn.decomposition import TruncatedSVD
SVD = TruncatedSVD(n_components=170, n_iter=7, random_state=42)
XSVD = SVD.fit_transform(Xrc)
print(SVD.explained_variance_ratio_.sum())
dSVD = pd.DataFrame(XSVD)
sSVD = dSVD[[0,1]]
sSVD.columns = ['x','y']
%%R -i sSVD --width 1200 --height 800 -r 140 --bg #F8F4E9
library(hrbrthemes)
library(ggplot2)
library(showtext)
font.add.google(name = "Alegreya Sans SC", family = "SC")
showtext.auto()
p <- ggplot(sSVD, aes(x=sSVD$x, y=sSVD$y)) + geom_point(color="#D65C0F", alpha=0.4,pch=16,cex=2.2)+
labs(x="", y="",
title="The first two components...",
subtitle="...as determined by SciKit learn's SVD-implementation.",
caption="by Maximilian Noichl")+
theme_ipsum()+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank())
p
Now that we have prepared our data, lets find some structure in there: First we will prepare various umap-mappings.
Umap is a pretty young technique for dimensionality reduction that can be used in a fashion similar to t-SNE. You can find out all about it here. The first mapping reduces the result of the SVD to 15 dimensions for the clustering. We do this to keep some more information to make our clusters more informative. But we can not go much higher, as HDBSCAN tends to fall prey to the curse of dimensionality.
Then a two-dimensional embedding, that we will use to look at our clusters in a scatter-plot. And finally a one-dimensional embedding that we will plot against the publication dates so that we have a visualization of the temporal flow of our clusters.
Umap can be tuned a little bit. I was trying here to achieve a quite finegrained embedding that makes the structure very visible.
import umap
try:
drc = drc.drop('x',axis=1)
drc = drc.drop('y',axis=1)
except KeyError:
pass
embedding15 = umap.UMAP(n_components=15,
n_neighbors=10,#small => local, large => global: 5-50
min_dist=0.05, #small => local, large => global: 0.001-0.5
metric='cosine').fit_transform(dSVD)
embedding15 = pd.DataFrame(embedding15)
#embedding15.columns = ['x','y']
#plt.scatter(embedding['x'], embedding['y'], color='grey')
import umap
try:
drc = drc.drop('x',axis=1)
drc = drc.drop('y',axis=1)
except KeyError:
pass
embedding = umap.UMAP(n_neighbors=14,#small => local, large => global: 5-50
min_dist=0.008, #small => local, large => global: 0.001-0.5
metric='cosine').fit_transform(dSVD)
embedding = pd.DataFrame(embedding)
embedding.columns = ['x','y']
#plt.scatter(embedding['x'], embedding['y'], color='grey')
try:
drc = drc.drop('xI',axis=1)
except KeyError:
pass
embeddingI = umap.UMAP(n_components=1,
n_neighbors=30,
min_dist=0.5,
metric='cosine').fit_transform(dSVD)#pd.concat([dSVD, drc['year']],axis=1)
embeddingI = pd.DataFrame(embeddingI)
embeddingI.columns = ['xI']
embeddingI = pd.concat([embeddingI, drc['year']],axis=1)
#plt.scatter(embeddingI['xI'], drc['year'], color='grey')
To look at the two-dimensional embedding, lets plot it with ggplot:
%%R -i embedding --width 1200 --height 800 -r 140 --bg #F5F5F5
library(hrbrthemes)
library(ggplot2)
library(fields)
embedding$density <- fields::interp.surface(
MASS::kde2d(embedding$x, embedding$y), embedding[,c("x","y")])
p <- ggplot(embedding, aes(x=embedding$x, y=embedding$y,alpha = 1/density))+
guides(alpha=FALSE)+
geom_point(color="#3366cc", pch=16,cex=2.2)+ theme_ipsum_rc()+
labs(x="", y="",
title="The 2d-reduction by UMAP",
subtitle="...based on the code by McInnes, Healy (2018)",
caption="by Maximilian Noichl")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank()
)
p
And now for the clustering. I am using h-DBSCAN, a density based clusterer, which seems to pair quite well with the output of UMAP. I will not explain here in detail how it works, but the documentation is very readable. It should be noted that our clustering can be as precise or as holistic as we want. I have chosen a rather high minimal cluster-size so that we can get at the big picture.
try:
drc = drc.drop('cluster',axis=1)
except KeyError:
pass
import hdbscan
#(min_cluster_size=500, min_samples=30, gen_min_span_tree=True)
#clusterer = hdbscan.HDBSCAN(min_cluster_size=455, min_samples=35, gen_min_span_tree=True)
clusterer = hdbscan.HDBSCAN(min_cluster_size=440, min_samples=30, gen_min_span_tree=True)
clusterer.fit(embedding15)
XCLUST = clusterer.labels_
clusternum = len(set( clusterer.labels_))-1
dfclust = pd.DataFrame(XCLUST)
dfclust.columns = ['cluster']
#plt.scatter(embedding['x'], embedding['y'], s=10, linewidth=0, c=cluster_colors, alpha=0.25)
#clusterer.condensed_tree_.plot()
#print(clusterer.condensed_tree_.to_pandas().head())
#clusterer.condensed_tree_.plot()
print(clusternum)
Now lets plot everything in ggplot:
%%R -i embedding,dfclust,embeddingI -o myNewColors --width 1200 --height 1800 -r 140 --bg #F8F4E9
library(hrbrthemes)
library(ggplot2)
library(fields)
options(warn=0)# 0 zum anschalten
#Get the cluster means:
means <- aggregate(embedding[,c("x","y")], list(dfclust$cluster), median)
means <- data.frame(means)
n=nrow(means)
means <- means[-1,]
#Make the colors:
mycolors <- c("#c03728",
"#919c4c",
"#fd8f24",
"#f5c04a",
"#e68c7c",
"#00666b",
"#142948",
"#6f5438")
pal <- colorRampPalette(sample(mycolors))
s <- n-1
myGray <- c('#95a5a6')
myNewColors <- sample(pal(s))
myPal <- append(myGray,myNewColors)
#get temporal means:
tmeans <- aggregate(embeddingI[,c("xI","year")], list(dfclust$cluster), median)
tmeans <- data.frame(tmeans)
tmeans <- tmeans[-1,]
#get density, to avoid overplotting
embedding$density <- fields::interp.surface(
MASS::kde2d(embedding$x, embedding$y), embedding[,c("x","y")])
#get temporal density
embeddingI$density <- fields::interp.surface(
MASS::kde2d(embeddingI$xI, embeddingI$year), embeddingI[,c("xI","year")])
p <- ggplot(embedding, aes(x=embedding$x, y=embedding$y, color= as.factor(dfclust$cluster), alpha = 1/density))+
geom_point(pch=20,cex=0.6)+
theme_ipsum_rc()+
scale_color_manual(values = myPal) +
guides(alpha=FALSE, color=FALSE)+
geom_point(data=means, aes(x=means$x, y=means$y), color= myNewColors, alpha = 1,size =7)+
annotate("text", x = means[,c("x")], y = means[,c("y")], label = means[,c("Group.1")], color="white", fontface="bold", size=4, parse = TRUE, hjust=0.5)+
labs(x="", y="",
title="The clusters found by hdbscan...",
subtitle="a density-based clustering algorithm. Embedded with UMAP in two dimensions...")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank())
t <- ggplot(embeddingI, aes(x=embeddingI$xI, y=embeddingI$year, color= as.factor(dfclust$cluster), alpha = 1/density))+
geom_point(pch=20,cex=0.4)+
geom_jitter()+
theme_ipsum_rc()+
scale_color_manual(values = myPal) +
guides(alpha=FALSE, color=FALSE)+
geom_point(data=tmeans, aes(x=tmeans$x, y=tmeans$y), color= myNewColors, alpha = 1,size =7)+
annotate("text", x = tmeans[,c("xI")], y = tmeans[,c("year")], label = tmeans[,c("Group.1")], color="white", fontface="bold", size=4, parse = TRUE, hjust=0.5)+
labs(x="", y="Publication date",
subtitle="...and one dimension, overlayed with publication dates on the y-axis.",
caption="by Maximilian Noichl")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank())
library(gridExtra)
grid.arrange(p,t, ncol = 1, heights = c(1, 1))
# pdf("ClusteringUMap.pdf", width = 12, height = 12) # Open a new pdf file
# grid.arrange(p,t, ncol = 1, heights = c(1, 1)) # Write the grid.arrange in the file\n",
# dev.off()