structure of philosophy

Mapping the Structure of Philosophy

In this jupyter notebook we will investigate the macro-structure of philosophical literature, and at the end we will produce the graphic above. As a base for our investigation I have collected about fifty-thousand records from the Web of Science collection, spanning from the forties to this very day. The question of how philosophy is structured has received quite a lot of attention, albeit usually under more specific formulations, for example when people ask whether there is a divide between "analytic" and "continental" philosophy, or whether philosophy is divided more along the lines of "naturalism" and "anti-naturalism". I think that the toolkit I provide below allows to give answers to these questions. As it is purely data-driven, it is free from most prior assumptions about how philosophy is structured. And because it encompasses a rather large sample of philosophical literature, it should guide us to a point of view, that is cleared from the personal expectations that we have from our own intellectual histories.

Let us begin by loading some libraries that we will need. Here I want to draw attention to the great metaknowledge-package that makes the handling of WOS-Files so much easier.

In [1]:
import metaknowledge as mk
import pandas as pd

import numpy as np
from random import randint
import datetime


%matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt
# plt.rcParams['figure.dpi']= 130
# from matplotlib.pyplot import figure


# plt.style.use('fivethirtyeight')

# sns.set_context('poster')
# sns.set_style('white')
# #sns.set_color_codes()
# plot_kwds = {'alpha' : 0.25, 's' : 10, 'linewidths':0}

#For Tables:
from IPython.display import display
pd.set_option('display.max_columns', 500)

#For R (ggplot2)
%load_ext rpy2.ipython

The WOS-Files where collected with a threefold snowball-sampling strategy. I started out with eight pretty different journals:

  • Analysis
  • The Papers of the British Society for Phenomenology
  • The Continental Philosophy Review
  • Erkenntnis
  • Ethics
  • The Journal of speculative Philosophy
  • Mind
  • Philosophical Quarterly
  • Philosophy and Social Criticism

These should provide a pretty diverse entry-point into philosophy. Of each journal I collected the 500 most cited papers, and analysed, which journals these papers cited. Of those journals I evaluated the citations of the top 500 papers of the 30 most cited journals (a total of around 15000 papers). Thus I arrived at a list of fifty philosophy journals, that should, by every sensible criterion, be a good aproximation for whatever philosophers have been interested in. I downloaded for each of this journals every record in the WOS-Database, arriving at collection of 54491 records.

Now, without further ado, lets load our raw data, filter out incomplete records and print a little summary of it.

In [2]:
date_string = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M")


RC = mk.RecordCollection("datanew")
In [3]:
print(RC.glimpse())
RC2 = mk.RecordCollection()

for R in RC:
    randnr = randint(0, 4)
    
    if len(R.getCitations().get("author"))>=3: # and randnr==0 apply condition in order to downsample records
        #Here we kick out every paper that cites less then 3 authors. Why? because they
        #are so dissimilar from the others, that they only produce noise.  
   
        try:
            R['year']
            #R['abstract']  #Add this when working with abstracts. It removes every paper that has none. 
            #This can sometimes remove whole journals, that are archived without abstracts, so handle with care.
            RC2.add(R)
        except KeyError:
            pass
    else:
        pass
    

print(RC2.glimpse())


RC = RC2
drc = pd.DataFrame.from_dict(RC.forNLP(extraColumns=['journal','AU']))
#Build a dataframe of bibliographic data for later use.
RecordCollection glimpse made at: 2018-08-02 16:15:38
54491 Records from files-from-datanew

Top Authors
1 SHELAH, S
2 MARGOLIS, J
3 HINTIKKA, J
4 Shelah, S
5 LOWE, EJ
6 RESCHER, N
6 LEWIS, D
7 KITCHER, P
8 [Anonymous]
9 CASTANEDA, HN
9 SOBER, E
10 Nanay, Bence
11 CHISHOLM, RM
11 PARGETTER, R
12 Turri, John
13 Douven, Igor
13 NIELSEN, K
13 JACKSON, F
14 SORENSEN, RA
15 Brueckner, A
16 ODEGARD, D
16 Brogaard, Berit

Top Journals
1 SYNTHESE
2 PHILOSOPHICAL STUDIES
3 JOURNAL OF SYMBOLIC LOGIC
4 PHILOSOPHY AND PHENOMENOLOGICAL RESEARCH
5 PHILOSOPHY OF SCIENCE
6 JOURNAL OF PHILOSOPHY
7 ANALYSIS
8 PHILOSOPHY
9 ETHICS
10 MONIST
10 SOUTHERN JOURNAL OF PHILOSOPHY
11 MIND
12 NOUS
13 AMERICAN PHILOSOPHICAL QUARTERLY
14 REVIEW OF METAPHYSICS
15 AUSTRALASIAN JOURNAL OF PHILOSOPHY
16 CANADIAN JOURNAL OF PHILOSOPHY
17 BRITISH JOURNAL FOR THE PHILOSOPHY OF SCIENCE
18 STUDIES IN HISTORY AND PHILOSOPHY OF SCIENCE
19 INQUIRY-AN INTERDISCIPLINARY JOURNAL OF PHILOSOPHY
20 JOURNAL OF PHILOSOPHICAL LOGIC
21 PHILOSOPHICAL QUARTERLY

Top Cited
1 Lewis David, 1986, PLURALITY WORLDS
2 Quine W. V. O., 1960, WORD OBJECT
3 Rawls J., 1971, THEORY JUSTICE
4 Kripke SA., 1980, NAMING NECESSITY
5 Lewis David, 1973, COUNTERFACTUALS
6 Williamson Timothy, 2000, KNOWLEDGE ITS LIMITS
7 Van Fraassen B. C., 1980, SCI IMAGE
8 Parfit D., 1984, REASONS PERSONS
9 Evans G., 1982, VARIETIES REFERENCE
10 Nozick R., 1981, PHILOS EXPLANATIONS
11 Lewis D., 1986, PHILOS PAPERS, VII
12 Davidson Donald, 1980, ESSAYS ACTIONS EVENT
13 ARISTOTLE, NICOMACHEAN ETHICS, V3, P1
14 Ryle Gilbert, 1949, CONCEPT MIND
15 Quine W. V., 1969, ONTOLOGICAL RELATIVI, P126
16 Woodward J., 2003, MAKING THINGS HAPPEN
17 Davidson D., 1984, INQUIRIES TRUTH INTE, P183
18 Nozick R., 1974, ANARCHY STATE UTOPIA, P36
19 Hempel C. G., 1965, ASPECTS SCI EXPLANAT
20 Putnam H., 1981, REASON TRUTH HIST
21 Scanlon T., 1998, WHAT WE OWE EACH OTH
22 Hume D., TREATISE HUMAN NATUR
RecordCollection glimpse made at: 2018-08-02 16:15:53
49235 Records from Empty

Top Authors
1 SHELAH, S
2 Shelah, S
3 HINTIKKA, J
4 MARGOLIS, J
5 KITCHER, P
6 SOBER, E
7 LEWIS, D
8 Nanay, Bence
8 PARGETTER, R
8 Turri, John
9 Douven, Igor
9 RESCHER, N
10 CHISHOLM, RM
10 CASTANEDA, HN
11 JACKSON, F
12 Carter, J. Adam
12 Brogaard, Berit
13 Machery, Edouard
14 NIELSEN, K
14 SORENSEN, RA
14 PETTIT, P
14 KEKES, J

Top Journals
1 SYNTHESE
2 PHILOSOPHICAL STUDIES
3 JOURNAL OF SYMBOLIC LOGIC
4 PHILOSOPHY OF SCIENCE
5 PHILOSOPHY AND PHENOMENOLOGICAL RESEARCH
6 JOURNAL OF PHILOSOPHY
7 ANALYSIS
8 MONIST
9 SOUTHERN JOURNAL OF PHILOSOPHY
10 NOUS
11 AMERICAN PHILOSOPHICAL QUARTERLY
12 PHILOSOPHY
13 ETHICS
14 AUSTRALASIAN JOURNAL OF PHILOSOPHY
15 MIND
16 REVIEW OF METAPHYSICS
17 CANADIAN JOURNAL OF PHILOSOPHY
18 STUDIES IN HISTORY AND PHILOSOPHY OF SCIENCE
19 BRITISH JOURNAL FOR THE PHILOSOPHY OF SCIENCE
20 JOURNAL OF PHILOSOPHICAL LOGIC
21 INQUIRY-AN INTERDISCIPLINARY JOURNAL OF PHILOSOPHY
22 ERKENNTNIS

Top Cited
1 Lewis David, 1986, PLURALITY WORLDS
2 Quine W. V. O., 1960, WORD OBJECT
3 Rawls J., 1971, THEORY JUSTICE
4 Kripke SA., 1980, NAMING NECESSITY
5 Lewis David, 1973, COUNTERFACTUALS
6 Williamson Timothy, 2000, KNOWLEDGE ITS LIMITS
7 Van Fraassen B. C., 1980, SCI IMAGE
8 Parfit D., 1984, REASONS PERSONS
9 Evans G., 1982, VARIETIES REFERENCE
10 Nozick R., 1981, PHILOS EXPLANATIONS
11 Lewis D., 1986, PHILOS PAPERS, VII
12 Davidson Donald, 1980, ESSAYS ACTIONS EVENT
13 ARISTOTLE, NICOMACHEAN ETHICS, V3, P1
14 Ryle Gilbert, 1949, CONCEPT MIND
15 Quine W. V., 1969, ONTOLOGICAL RELATIVI, P126
16 Woodward J., 2003, MAKING THINGS HAPPEN
17 Hempel C. G., 1965, ASPECTS SCI EXPLANAT
18 Davidson D., 1984, INQUIRIES TRUTH INTE, P183
19 Nozick R., 1974, ANARCHY STATE UTOPIA, P36
20 Scanlon T., 1998, WHAT WE OWE EACH OTH
20 Putnam H., 1981, REASON TRUTH HIST
21 Dretske F.I., 1981, KNOWLEDGE FLOW INFOR

The table above gives us some statistics about the data we are working with. As you can see, we had to remove around 6000 records for reasons of missing data, but that should be fine. The summaries show us the most prolific authors, the journals with the most occurences and the most cited single works. All of this makes sense so far. We have the incredibly popular David Lewis with multiple mentions in the the the top cited works, along with some other very well known recent authors, and of course the most influential of classics, Aristotle, Hume & Kant.

Now we have to decide what structure we are interested in. We have quite a lot of possibilities to work with in the data. Below I have supplied code for three: Clustering by specific cited works, clustering by the names of cited authors, and clustering by the content of the abstracts. Here we will look only at the first: the clusters that are generated through the analysis of the citations of specific works. Formally, we are looking for cocitation communities of papers. In real live, this should translate to something between a tradition, in which people cite the intellectual heroes they have in common, and a thematic grouping, where people that are interested in a subject cite other people that have worked on that subject. This approach has the advantage of a certain granularity, as it allows for example to differentiate between papers that cite Nozick on political theory and those that are interested his work on epistemology.

In [4]:
########### Clustering by Cited Works ############

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re

d = []
citedAU = []
citestring =[]
for R in RC:

    d.append(list(set(R.getCitations().get("citeString")))) #To cluster by cited author
    citedAU.append(list(set(R.getCitations().get("author"))))
    citestring.append(list(set(R.getCitations().get("citeString"))))

drc["citedAU"] = citedAU
drc["citestring"] = citestring

authorslist = ['§'.join(filter(None,x)) for x in list(d)] 


vec = TfidfVectorizer(token_pattern=r'(?<=[^|§])[\s\w,\.:;]+(?=[$|§])')
Xrc = vec.fit_transform(authorslist)

#display(pd.DataFrame(Xrc.toarray(), columns=vec.get_feature_names()).transpose()) #To look into the vectors. Beware, can take a bit of RAM

Now we conduct a dimensionality-reduction of the data, to make what follows faster. Afterwards we make a little plot in R. In our case of the large dataset this is not that interesting, but if we have small data that is very diverse, e. g. a few hundred papers from very different disciplines, the SVD can already make out some structure.

In [5]:
from sklearn.decomposition import TruncatedSVD
SVD = TruncatedSVD(n_components=170, n_iter=7, random_state=42)

XSVD = SVD.fit_transform(Xrc)
print(SVD.explained_variance_ratio_.sum())
dSVD = pd.DataFrame(XSVD)

sSVD = dSVD[[0,1]]
sSVD.columns = ['x','y']
0.03946636383314643
In [68]:
%%R -i sSVD --width 1200 --height 800 -r 140 --bg #F8F4E9
library(hrbrthemes)
library(ggplot2)

library(showtext)
font.add.google(name = "Alegreya Sans SC", family = "SC")
showtext.auto()


p <- ggplot(sSVD, aes(x=sSVD$x, y=sSVD$y)) + geom_point(color="#D65C0F", alpha=0.4,pch=16,cex=2.2)+
labs(x="", y="",
       title="The first two components...",
       subtitle="...as determined by SciKit learn's SVD-implementation.",
       caption="by Maximilian Noichl")+
theme_ipsum()+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank())

 


p

UMAP and Clustering

Now that we have prepared our data, lets find some structure in there: First we will prepare various umap-mappings.

Umap is a pretty young technique for dimensionality reduction that can be used in a fashion similar to t-SNE. You can find out all about it here. The first mapping reduces the result of the SVD to 15 dimensions for the clustering. We do this to keep some more information to make our clusters more informative. But we can not go much higher, as HDBSCAN tends to fall prey to the curse of dimensionality.

Then a two-dimensional embedding, that we will use to look at our clusters in a scatter-plot. And finally a one-dimensional embedding that we will plot against the publication dates so that we have a visualization of the temporal flow of our clusters.

Umap can be tuned a little bit. I was trying here to achieve a quite finegrained embedding that makes the structure very visible.

In [7]:
import umap

try:
    drc = drc.drop('x',axis=1)
    drc = drc.drop('y',axis=1)

except KeyError:
    pass


embedding15 = umap.UMAP(n_components=15,
                    n_neighbors=10,#small => local, large => global: 5-50
                      min_dist=0.05, #small => local, large => global: 0.001-0.5
                      metric='cosine').fit_transform(dSVD)
embedding15 = pd.DataFrame(embedding15)
#embedding15.columns = ['x','y']
#plt.scatter(embedding['x'], embedding['y'], color='grey')
In [126]:
import umap

try:
    drc = drc.drop('x',axis=1)
    drc = drc.drop('y',axis=1)

except KeyError:
    pass


embedding = umap.UMAP(n_neighbors=14,#small => local, large => global: 5-50
                      min_dist=0.008, #small => local, large => global: 0.001-0.5
                      metric='cosine').fit_transform(dSVD)
embedding = pd.DataFrame(embedding)
embedding.columns = ['x','y']
#plt.scatter(embedding['x'], embedding['y'], color='grey')
In [ ]:
try:
    drc = drc.drop('xI',axis=1)
except KeyError:
    pass

embeddingI = umap.UMAP(n_components=1,
                        n_neighbors=30,
                      min_dist=0.5,
                      metric='cosine').fit_transform(dSVD)#pd.concat([dSVD, drc['year']],axis=1)
embeddingI = pd.DataFrame(embeddingI)
embeddingI.columns = ['xI']
embeddingI = pd.concat([embeddingI, drc['year']],axis=1)

#plt.scatter(embeddingI['xI'], drc['year'], color='grey')

To look at the two-dimensional embedding, lets plot it with ggplot:

In [22]:
%%R -i embedding --width 1200 --height 800 -r 140 --bg #F5F5F5
library(hrbrthemes)
library(ggplot2)
library(fields)
embedding$density <- fields::interp.surface(
  MASS::kde2d(embedding$x, embedding$y), embedding[,c("x","y")])

p <- ggplot(embedding, aes(x=embedding$x, y=embedding$y,alpha = 1/density))+

 guides(alpha=FALSE)+

geom_point(color="#3366cc", pch=16,cex=2.2)+ theme_ipsum_rc()+
labs(x="", y="",
       title="The 2d-reduction by UMAP",
       subtitle="...based on the code by McInnes, Healy (2018)",
       caption="by Maximilian Noichl")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank()
)
p
C:\Users\user\Anaconda3\lib\site-packages\rpy2\robjects\pandas2ri.py:190: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

And now for the clustering. I am using h-DBSCAN, a density based clusterer, which seems to pair quite well with the output of UMAP. I will not explain here in detail how it works, but the documentation is very readable. It should be noted that our clustering can be as precise or as holistic as we want. I have chosen a rather high minimal cluster-size so that we can get at the big picture.

In [79]:
try:
    drc = drc.drop('cluster',axis=1)
except KeyError:
    pass

import hdbscan

#(min_cluster_size=500, min_samples=30, gen_min_span_tree=True)
#clusterer = hdbscan.HDBSCAN(min_cluster_size=455, min_samples=35, gen_min_span_tree=True)

clusterer = hdbscan.HDBSCAN(min_cluster_size=440, min_samples=30, gen_min_span_tree=True)
clusterer.fit(embedding15)
XCLUST = clusterer.labels_
clusternum = len(set( clusterer.labels_))-1


dfclust = pd.DataFrame(XCLUST)
dfclust.columns = ['cluster']



#plt.scatter(embedding['x'], embedding['y'], s=10, linewidth=0, c=cluster_colors, alpha=0.25)
#clusterer.condensed_tree_.plot()
#print(clusterer.condensed_tree_.to_pandas().head())
#clusterer.condensed_tree_.plot()
print(clusternum)
22

Now lets plot everything in ggplot:

In [80]:
%%R -i embedding,dfclust,embeddingI -o myNewColors --width 1200 --height 1800 -r 140 --bg #F8F4E9

library(hrbrthemes)
library(ggplot2)
library(fields)

options(warn=0)# 0 zum anschalten

#Get the cluster means:
means <- aggregate(embedding[,c("x","y")], list(dfclust$cluster), median)
means <- data.frame(means) 
n=nrow(means)
means <- means[-1,]

#Make the colors: 
mycolors <- c("#c03728",
"#919c4c",
"#fd8f24",
"#f5c04a",
"#e68c7c",
"#00666b",
"#142948",
"#6f5438") 

pal <- colorRampPalette(sample(mycolors))
s <- n-1
myGray <- c('#95a5a6')
myNewColors <- sample(pal(s))
myPal <- append(myGray,myNewColors)






#get temporal means:
tmeans <- aggregate(embeddingI[,c("xI","year")], list(dfclust$cluster), median)
tmeans <- data.frame(tmeans) 
tmeans <- tmeans[-1,]



#get density, to avoid overplotting
embedding$density <- fields::interp.surface(
  MASS::kde2d(embedding$x, embedding$y), embedding[,c("x","y")])

#get temporal density
embeddingI$density <- fields::interp.surface(
  MASS::kde2d(embeddingI$xI, embeddingI$year), embeddingI[,c("xI","year")])






p <- ggplot(embedding, aes(x=embedding$x, y=embedding$y, color= as.factor(dfclust$cluster), alpha = 1/density))+


geom_point(pch=20,cex=0.6)+ 
theme_ipsum_rc()+
scale_color_manual(values = myPal) +
 guides(alpha=FALSE, color=FALSE)+
geom_point(data=means, aes(x=means$x, y=means$y), color= myNewColors, alpha = 1,size =7)+
annotate("text", x = means[,c("x")], y = means[,c("y")], label = means[,c("Group.1")], color="white", fontface="bold",  size=4, parse = TRUE, hjust=0.5)+
labs(x="", y="",
       title="The clusters found by hdbscan...",
       subtitle="a density-based clustering algorithm. Embedded with UMAP in two dimensions...")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank())




t <- ggplot(embeddingI, aes(x=embeddingI$xI, y=embeddingI$year, color= as.factor(dfclust$cluster), alpha = 1/density))+
geom_point(pch=20,cex=0.4)+
geom_jitter()+
theme_ipsum_rc()+
scale_color_manual(values = myPal) +
guides(alpha=FALSE, color=FALSE)+
geom_point(data=tmeans, aes(x=tmeans$x, y=tmeans$y), color= myNewColors, alpha = 1,size =7)+
annotate("text", x = tmeans[,c("xI")], y = tmeans[,c("year")], label = tmeans[,c("Group.1")], color="white", fontface="bold",  size=4, parse = TRUE, hjust=0.5)+
labs(x="", y="Publication date",
         subtitle="...and one dimension, overlayed with publication dates on the y-axis.",
       caption="by Maximilian Noichl")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank())


library(gridExtra)
grid.arrange(p,t, ncol = 1,  heights = c(1, 1))

# pdf("ClusteringUMap.pdf", width = 12, height = 12) # Open a new pdf file
# grid.arrange(p,t, ncol = 1,  heights = c(1, 1)) # Write the grid.arrange in the file\n",
# dev.off()
C:\Users\user\Anaconda3\lib\site-packages\rpy2\robjects\pandas2ri.py:190: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)