An older version of my philosophy-maps . I’m mainly keeping it around, because it contains some useful code.

plot.png

In this codebook we will investigate the macro-structure of philosophical literat ure. As a base for our investigation I have collected about fifty-thousand records from the Web of Science collection, spanning from the late forties to this very day.

The question of how philosophy is structured has received quite a lot of attention, albeit usually under more specific formulations, for example when people ask whether there is a divide between “analytic” and “continental” philosophy, or whether philosophy is divided more along the lines of “naturalism” and “anti-naturalism”. I think that the toolkit I provide below allows to give answers to these questions. As it is purely data-driven, it is free from most prior assumptions about how philosophy is structured. And because it encompasses a rather large sample of philosophical literature, it should guide us to a point of view, that is cleared from the personal expectations that we have from our own intellectual histories.

We will focus in this post mostly on the visualization aspect, to build some intuitions about the structure of the philosophical literature and will not attempt to answer specific questions.

This work makes use of several great python libraries. I would like to note specifically:

  • Metaknowledge which we will use to parse the data from the WebOfScience-data.
  • UMAP which we will use for the embedding of the data.
  • hdbscan which we will use to do the clustering.

Note: This is the second online version of this project. It differs from the first in the following aspects:

  • It uses a different, and I think more advanced method of vectorization.
  • I have fixed an error in the data spotted by Fabio Votta.
  • I updated the visualization technique.
  • I cluster on the 2-dimensional umap, not on the thirty dimensional, which works surprisingly well.

Literature

  • McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

  • McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

  • Reid McIlroy-Young, John McLevey, and Jillian Anderson. 2015. metaknowledge: open source software for social networks, bibliometrics, and sociology of knowledge research. URL: http://www.networkslab.org/metaknowledge.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    
    import metaknowledge as mk
    import pandas as pd
    import numpy as np
    from random import randint
    import datetime
    import scipy as scipy
    
    %matplotlib inline
    
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    #For Tables:
    from IPython.display import display
    from IPython.display import Latex
    pd.set_option('display.max_columns', 500)
    
    #For R (ggplot2)
    %load_ext rpy2.ipython

    The rpy2.ipython extension is already loaded. To reload it, use: %reload_ext rpy2.ipython

The WOS-Files where collected with a threefold snowball-sampling strategy. I started out with eight pretty different journals:

  • Analysis
  • The Papers of the British Society for Phenomenology
  • The Continental Philosophy Review
  • Erkenntnis
  • Ethics
  • The Journal of speculative Philosophy
  • Mind
  • Philosophical Quarterly
  • Philosophy and Social Criticism

These should provide a pretty diverse entry-point into philosophy. Of each journal I collected the 500 most cited papers, and analysed, which journals these papers cited. Of those journals I evaluated the citations of the top 500 papers of the 30 most cited journals (a total of around 15000 papers). Thus I arrived at a list of fifty philosophy journals, that should, by every sensible criterion, be a good aproximation for whatever philosophers have been interested in. I downloaded for each of this journals every record in the WOS-Database, arriving at collection of 56062 records.

Now, without further ado, lets load our raw data, filter out incomplete records and print a little summary of it.

1
2
3
4
date_string = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M")


RC = mk.RecordCollection("datanew")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# print(RC.glimpse())
RC2 = mk.RecordCollection()

for R in RC:
    randnr = randint(0, 4)
    
    if len(R.getCitations().get("author"))>=1: # and randnr==0 apply condition in order to downsample records
        #Here we kick out every paper that cites less then 3 authors. Why? because they
        #are so dissimilar from the others, that they only produce noise.  
   
        try:
            R['year']
            #if R['year']>1961:
            #R['abstract']  #Add this when working with abstracts. It removes every paper that has none. 
            #This can sometimes remove whole journals, that are archived without abstracts, so handle with care.
            RC2.add(R)
        except KeyError:
            pass
    else:
        pass
    

print(RC2.glimpse())


RC = RC2
RecordCollection glimpse made at: 2018-09-16 00:13:50
53491 Records from Empty

Top Authors
1 SHELAH, S
2 Shelah, S
2 HINTIKKA, J
3 MARGOLIS, J
4 LOWE, EJ
5 LEWIS, D
6 SOBER, E
7 KITCHER, P
8 RESCHER, N
9 CHISHOLM, RM
9 PARGETTER, R
10 Nanay, Bence
10 JACKSON, F
10 Turri, John
10 CASTANEDA, HN
11 Douven, Igor
12 PETTIT, P
13 Brueckner, A
13 Brogaard, Berit
13 SORENSEN, RA
14 Carter, J. Adam
14 VANFRAASSEN, BC

Top Journals
1 SYNTHESE
2 PHILOSOPHICAL STUDIES
3 JOURNAL OF SYMBOLIC LOGIC
4 PHILOSOPHY OF SCIENCE
5 PHILOSOPHY AND PHENOMENOLOGICAL RESEARCH
6 ANALYSIS
7 JOURNAL OF PHILOSOPHY
8 MONIST
9 PHILOSOPHY
10 SOUTHERN JOURNAL OF PHILOSOPHY
11 ETHICS
12 NOUS
13 AMERICAN PHILOSOPHICAL QUARTERLY
14 MIND
15 AUSTRALASIAN JOURNAL OF PHILOSOPHY
16 REVIEW OF METAPHYSICS
17 CANADIAN JOURNAL OF PHILOSOPHY
18 BRITISH JOURNAL FOR THE PHILOSOPHY OF SCIENCE
19 STUDIES IN HISTORY AND PHILOSOPHY OF SCIENCE
20 JOURNAL OF PHILOSOPHICAL LOGIC
21 INQUIRY-AN INTERDISCIPLINARY JOURNAL OF PHILOSOPHY
22 PHILOSOPHICAL QUARTERLY

Top Cited
1 Lewis David, 1986, PLURALITY WORLDS
2 Quine W. V. O., 1960, WORD OBJECT
3 RAWLS J, 1971, THEORY JUSTICE, P530
4 Lewis David, 1973, COUNTERFACTUALS
5 Kripke SA., 1980, NAMING NECESSITY
6 Williamson Timothy, 2000, KNOWLEDGE ITS LIMITS
7 Parfit D., 1984, REASONS PERSONS
8 Van Fraassen B. C., 1980, SCI IMAGE
9 Evans G., 1982, VARIETIES REFERENCE
10 Nozick R., 1981, PHILOS EXPLANATIONS
11 Lewis D., 1986, PHILOS PAPERS, V2, P159
12 Davidson Donald, 1980, ESSAYS ACTIONS EVENT
13 Aristotle, NICOMACHEAN ETHICS
14 Ryle Gilbert, 1949, CONCEPT MIND
15 Nozick R., 1974, ANARCHY STATE UTOPIA
16 Davidson D, 1984, INQUIRIES TRUTH INTE
17 Woodward J., 2003, MAKING THINGS HAPPEN
18 Putnam H, 1981, REASON TRUTH HIST, P1
19 Hume D., TREATISE HUMAN NATUR
20 Quine W. V., 1969, ONTOLOGICAL RELATIVI
21 Hempel C. G., 1965, ASPECTS SCI EXPLANAT
22 Scanlon T., 1998, WHAT WE OWE EACH OTH

Above we have some statistics about the data we are working with. We have lost roughly 2000 records where data was missing. The summaries show us the most prolific authors, the journals with the most occurences and the most cited single works. All of this makes sense so far. We have the incredibly popular David Lewis with multiple mentions in the the the top cited works, along with some other very well known recent authors, and of course the most influential of classics, Aristotle & Hume.

Extracting the Features

In order to use UMAP and the clustering algorithm, we have to extract some features to work with.

I have chosen to use two kinds: + cited works and + cited authors

Cited works are the precise citation string that the WOS-Collection uses. These are very good to get the fine-grained structure of the literature, as they can be very specific. They allow us for example to differentiate between the epistemologic and the political works of Robert Nozick. Cited authors on the other hand are to a certain extent redundant, as they are only a less precise form of cited works. But they are valuable for us, as they give us a general corner in which a paper belongs. This allows us to use much more of our data, as relying only on cited works forces us to kick out many papers that are only weakly linked to the rest. I have done some experiments with various combinations of vectorizations (author, work, words in abstract, title, etc.) and their combinations on labeled data, and the results of combining works and authors seems to work best. Look here for a little proof of concept that shows how well we can differentiate different disciplines.

Both types of features are extracted with scikit-learn and concatenated. Than we filter out everything that is weakly linked, as it tends to “ball up” in the UMAP without containing useful information.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
########### Cited Works - Features ############

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re
drc = pd.DataFrame.from_dict(RC.forNLP(extraColumns=['journal','AU','FU','PD']))

d = []
citedAU = []
citestring =[]
for R in RC:

    d.append(list(set(R.getCitations().get("citeString")))) #To cluster by cited author
    citedAU.append(list(set(R.getCitations().get("author"))))
    citestring.append(list(set(R.getCitations().get("citeString"))))

drc["citedAU"] = citedAU
drc["citestring"] = citestring
#print(d[0])
authorslist = ['§'.join(filter(None,x)) for x in list(d)] 
#print(authorslist[0])

# vec = TfidfVectorizer(token_pattern=r'(?<=[^|§])[\s\w,\.:;]+(?=[$|§])')
vec = CountVectorizer(token_pattern=r'(?<=[§])[\s\w,\.:;\/\[\]-]+(?=[§])',binary=True, min_df = 3)#, min_df = 1)


Xrc = vec.fit_transform(authorslist)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
########### Authors - Features ############

d = []
for R in RC:
    authors = list(set(R.getCitations().get("author")))
#    print(authors)
    authors = filter(None, authors)
    f = []
    for a in authors:
        f.append(' '.join([w for w in a.split(' ')if len(w)>2]))
        
    authors = f#' '.join(f)
    d.append(authors)
authorslist = [';'.join(filter(None,x)) for x in list(d)] 
vec = CountVectorizer(token_pattern=r'(?<=[;])[\s\w]+(?=[;])',binary=True, min_df = 10)

XrcAu = vec.fit_transform(authorslist)
1
2
k = [XrcAu,Xrc]
XrcFull = scipy.sparse.hstack(k).tocsr()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
###### Filtering #######
from scipy.sparse import coo_matrix, vstack
from scipy.sparse import csr_matrix
import scipy as scipy
row_names = np.array(drc["id"])

newdf=[]
a = 0
# index by name:
for x in range(0,XrcFull.shape[0]): #Xrc.shape[0]):
    row_idx, = np.where(row_names == drc["id"][x])
    if np.diff(XrcFull[row_idx].tocsr().indptr) >= 4:
        if a == 0:
            k = [XrcFull[row_idx]]
        if a != 0:
            k.append(XrcFull[row_idx])
        a = a+1
        newdf.append(drc.loc[x])
        
drc = pd.DataFrame(newdf).reset_index()
M = scipy.sparse.vstack((k))

Preliminary dimensionality-reduction with SVD

This is strictly speaking not necessary: We could pass our vectors directly to the umap-algorithm. But when we use a lot of data, we can get a sharper image when we clear out some noise with SVD beforehand. If we were interested more in classification and less in visualization, I would suggest to skip this step, reduce with umap to ~thirty dimensions and cluster on that.

1
2
3
4
5
6
7
8
9
from sklearn.decomposition import TruncatedSVD
SVD = TruncatedSVD(n_components=350, n_iter=7, random_state=42)

XSVD = SVD.fit_transform(M)
print(SVD.explained_variance_ratio_.sum())
dSVD = pd.DataFrame(XSVD)

sSVD = dSVD[[0,1]]
sSVD.columns = ['x','y']
0.2762670795063329

Now for the UMAP

Umap is pretty young technique for dimensionality reduction, which has the big advantage of beeing pretty fast. And it seems to preserve global structure quite reliably, which is nice, as it enables us to cluster afterwards. We will plot the 2D-embedding with ggplot, so that we have something to look at:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import umap

try:
    drc = drc.drop('x',axis=1)
    drc = drc.drop('y',axis=1)

except KeyError:
    pass


embedding = umap.UMAP(n_neighbors = 7,#small => local, large => global: 5-50
                      min_dist = 0.0005, #small => local, large => global: 0.001-0.5
                      spread = 1.5,
                      metric='cosine').fit_transform(XSVD)
embedding = pd.DataFrame(embedding)
embedding.columns = ['x','y']
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
%%R -i embedding --width 1200 --height 800 -r 140 --bg #F5F5F5
library(hrbrthemes)
library(ggplot2)
library(fields)
embedding$density <- fields::interp.surface(
  MASS::kde2d(embedding$x, embedding$y), embedding[,c("x","y")])

p <- ggplot(embedding, aes(x=embedding$x, y=embedding$y,alpha = 1/density))+

 guides(alpha=FALSE)+

geom_point(color="#3366cc", pch=16,cex=1.2)+ theme_ipsum_rc()+
labs(x="", y="",
       title="The 2d-reduction by UMAP",
       subtitle="...based on the code by McInnes, Healy (2018)",
       caption="by Maximilian Noichl")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank()
)
p
C:\Users\user\Anaconda3\lib\site-packages\rpy2\robjects\pandas2ri.py:190: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

png

Temporal Embedding

Now we will do the same thing we did above, but we will embedd the data just in one dimension and then re-embedd every year in the resulting structure.

1
drc['timestamp'] = drc["year"] + drc["PD"].fillna(0).replace('', 0, regex=True)/12
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
n_neighbors = 7
embeddingI = umap.UMAP(n_components=1,
                        n_neighbors=n_neighbors,
                      min_dist=0.008,
                       spread = 0.8,
                      metric='cosine').fit(XSVD)

coordinates = []
for year in range(int(drc["year"].min()),int(drc["year"].max())):
    l = list(np.where(drc["year"] == year)[0])
    L = XSVD[l]     

    emb = embeddingI.transform(L)
    emb = pd.DataFrame(emb)
    emb.columns = ['xI']

    emb["year"] = drc.iloc[l,:]["timestamp"].tolist()
    coordinates.append(emb)

coordinates = pd.concat(coordinates, ignore_index=True)

embeddingI = pd.DataFrame(embeddingI.embedding_)
embeddingI.columns = ['xI']
embeddingI["year"] = drc["timestamp"].tolist()
C:\Users\user\Anaconda3\lib\site-packages\umap_learn-0.3.3-py3.6.egg\umap\spectral.py:229: UserWarning: Embedding a total of 2 separate connected components using meta-embedding (experimental)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
%%R -i coordinates --width 1200 --height 800 -r 140 --bg #F5F5F5
library(hrbrthemes)
library(ggplot2)
library(fields)
coordinates$density <- fields::interp.surface(
  MASS::kde2d(coordinates$xI, coordinates$year), coordinates[,c("xI","year")])

p <- ggplot(coordinates, aes(x=coordinates$xI, y=coordinates$year,alpha =1/density))+#1/density

 guides(alpha=FALSE)+

geom_point(color="#3366cc", pch=16,cex=1.2)+ theme_ipsum_rc()+
labs(x="", y="",
       title="The temporal UMAP-embedding",
       subtitle="...based on the code by McInnes, Healy (2018)",
       caption="by Maximilian Noichl")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank()
)
p
C:\Users\user\Anaconda3\lib\site-packages\rpy2\robjects\pandas2ri.py:190: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

png

As you can see, the web of science started to archive the months of publishing only in the late eighties, which is why the plot has these lines at the bottom, where we can assign only years to the publications.

Clustering with HDBSCAN

Now we use HDBSCAN to cluster our data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
try:
    drc = drc.drop('cluster',axis=1)
except KeyError:
    pass

import hdbscan

#(min_cluster_size=500, min_samples=30, gen_min_span_tree=True)
#clusterer = hdbscan.HDBSCAN(min_cluster_size=455, min_samples=35, gen_min_span_tree=True)

clusterer = hdbscan.HDBSCAN(min_cluster_size=440, min_samples=50, gen_min_span_tree=True)
clusterer.fit(embedding)
XCLUST = clusterer.labels_
clusternum = len(set( clusterer.labels_))-1


dfclust = pd.DataFrame(XCLUST)
dfclust.columns = ['cluster']


print(clusternum)
### Let's play a little sound when we're done:
import winsound
winsound.Beep(550,300)
25

Now lets plot everything in ggplot:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
%%R -i embedding,dfclust,embeddingI -o myNewColors,tmeans --width 1200 --height 1800 -r 140 --bg #F8F4E9

library(hrbrthemes)
library(ggplot2)
library(fields)
library(plyr)
options(warn=0)# 0 zum anschalten

#Get the cluster means:
means <- aggregate(embedding[,c("x","y")], list(dfclust$cluster), median)
means <- data.frame(means) 
n=nrow(means)
means <- means[-1,]

#Make the colors: 
mycolors <- c("#c03728","#919c4c","#fd8f24","#f5c04a","#e68c7c","#00666b","#142948","#6f5438") 

pal <- colorRampPalette(sample(mycolors))
s <- n-1
myGray <- c('#95a5a6')
myNewColors <- sample(pal(s))
myPal <- append(myGray,myNewColors)

#get temporal means:
tmeans <- aggregate(embeddingI[,c("xI","year")], list(dfclust$cluster), median)
tmeans <- data.frame(tmeans) 
tmeans <- tmeans[-1,]




#get density, to avoid overplotting
embedding$density <- fields::interp.surface(
  MASS::kde2d(embedding$x, embedding$y), embedding[,c("x","y")])

#get temporal density
embeddingI$density <- fields::interp.surface(
  MASS::kde2d(embeddingI$xI, embeddingI$year), embeddingI[,c("xI","year")])


p <- ggplot(embedding, aes(x=embedding$x, y=embedding$y, color= as.factor(dfclust$cluster), alpha = 1/density))+
geom_point(pch=20,cex=1.6)+ 
theme_ipsum_rc()+
scale_color_manual(values = myPal) +
 guides(alpha=FALSE, color=FALSE)+
geom_point(data=means, aes(x=means$x, y=means$y), color= myNewColors, alpha = 1,size =6)+
annotate("text", x = means[,c("x")], y = means[,c("y")], label = means[,c("Group.1")], color="white", fontface="bold",  size=3.2, parse = TRUE, hjust=0.5)+
labs(x="", y="",
       title="The clusters found by hdbscan...",
       subtitle="a density-based clustering algorithm. Embedded with UMAP in two dimensions...")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank())


t <- ggplot(embeddingI, aes(x=embeddingI$xI, y=embeddingI$year, color= as.factor(dfclust$cluster), alpha = 1/density))+
geom_point(pch=20,cex=1.2)+
theme_ipsum_rc()+
scale_color_manual(values = myPal) +
guides(alpha=FALSE, color=FALSE)+
geom_point(data=tmeans, aes(x=tmeans$x, y=tmeans$y), color= myNewColors, alpha = 1,size =4.2)+
annotate("text", x = tmeans[,c("xI")], y = tmeans[,c("year")], label = tmeans[,c("Group.1")], color="white", fontface="bold",  size=2.2, parse = TRUE, hjust=0.5)+
labs(x="", y="Publication date",
         subtitle="...and one dimension, overlayed with publication dates on the y-axis.",
       caption="by Maximilian Noichl")+
theme(panel.grid.major = element_line(colour = "lightgrey"),panel.grid.minor = element_blank())


library(gridExtra)
grid.arrange(p,t, ncol = 1,  heights = c(1, 1))

# pdf("ClusteringUMap.pdf", width = 12, height = 12) # Open a new pdf file
# grid.arrange(p,t, ncol = 1,  heights = c(1, 1)) # Write the grid.arrange in the file\n",
# dev.off()

png

Thats nice. To have look into the way the clustering algorithm has structured the data, lets look at the condensed tree.
I messed around a bit in my installation of HDBSCAN, so if you run this on your computer, your tree will propably look quite different. The condensed clustering tree basically tells us, when the algorithm found it necessary to break a group apart into two smaller clusters. On the left of the tree we see the clusters that were so far removed from the central structure, that they broke off at the very beginning of the clustering process.

1
2
3
4
5
6
import matplotlib.colors
plt.rcParams['figure.figsize'] = [20, 20]
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ["#bfc4c6","#bfc4c6","#bfc4c6"])

clusterer.condensed_tree_.plot(cmap = cmap,select_clusters=True,label_clusters = True, selection_palette=myNewColors, 
                               colorbar = False,max_rectangles_per_icicle=80, alpha=0.7,barwidthfactor=5,linecolor='#bfc4c6',linewidth=2)
<matplotlib.axes._subplots.AxesSubplot at 0x18c514a1588>

png

What does it mean?

Now, let us look into the clusters to find out what they actually contain. First we will analyze the abstracts of the papers in every cluster according to their most common words and bigrams. In the tables below, every column is a cluster, and every row is a common word. Then we will do the exact same thing with the most cited authors.

1
2
3
drc = pd.concat([drc, dfclust],axis=1)
drc = drc.dropna(subset=['cluster'])
drc = pd.concat([drc, embedding],axis=1)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
fullstrsl = []
for x in range(0,clusternum):
    abstracts = list(drc.loc[drc['cluster'] == x]['abstract'])
    abstracts = ";".join(str(x) for x in abstracts).replace('|',' ').replace('paper',' ').replace('argue',' ').replace('account',' ').replace('theory',' ') #kick out common abstract words of no importance
    fullstrsl.append(abstracts)
    
vec = CountVectorizer( stop_words='english')#Choose CountVectorizer for the most common words in the cluster, TfidfVectorizer for the words with the greatest differentiation value.
X = vec.fit_transform(fullstrsl)
#print(pd.DataFrame(X.toarray(), columns=vec.get_feature_names())) #To look into the vectors. Beware, can take a bit of RAM


clusterfeatures = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
fullscore = []
for x in range(0,clusternum):
    scores = zip(vec.get_feature_names(), np.asarray(X[x,:].sum(axis=0)).ravel())
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    myscores = sorted_scores[0:20]
    
    scorelist = []
    for s in myscores:
        scorelist.append(s[0])
    fullscore.append(scorelist)
display(pd.DataFrame(fullscore).transpose())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0 free logic coherence logic logic causal probability newtons conditionals science justification knowledge evolutionary moral kants moral moral fictional cognitive phenomenal properties argument names language language
1 argument truth properties semantics logics explanation models theories causation quantum epistemic epistemic selection aristotle kant rights reasons logic mental experience mental objects view logic quines
2 responsibility logics confirmation language epistemic explanations scientific newton causal scientific belief belief evolution virtue moral argument view argument view consciousness physical properties argument truth view
3 moral paradox dispositions semantic knowledge models evidence natural counterfactual theories knowledge argument species character argument view argument fiction mind argument causal view semantic freges truth
4 agent model analysis analysis models mechanistic science mathematics semantics explanation beliefs view biological view philosophy problem normative aesthetic states experiences problem modal reference wittgensteins logical
5 action classical measures set agents mechanisms problem logic counterfactuals mechanics justified evidence natural ethics view claim reason problem content perceptual argument problem terms frege theories
6 agents logical models natural model explanatory decision set conditional problem problem justification biology philosophical human human value view psychology view causation possible proper logical philosophy
7 principle models measure model belief model bayesian theorem view understanding view assertion social nature nature life good arguments argument content action time modal philosophical argument
8 responsible semantics laws context dynamic science argument philosophy problem causal evidence problem concept aristotles science social problem semantics cognition properties view worlds descriptions view logic
9 determinism notion evidence meaning finite scientific model new argument philosophy perceptual know different social does does claim question theories states physicalism arguments kind problem semantics
10 does theories argument notion modal mechanism theories classical analysis view epistemology cases kinds good reason order theories questions function perception emergence identity semantics argument carnaps
11 cases semantic robustness structure information different belief proof cases physical argument contextualism models knowledge arguments people action way language conscious actions way arguments wittgenstein quine
12 arguments consequence nature interpretation results biology probabilities analysis indicative new truth true genetic model natural morally way epistemic problem mental exclusion world natural philosophy philosophical
13 problem modal dispositional view logical biological view mathematical theories explanations conservatism knows view does way value agents knowledge systems physicalism claim different theories objects science
14 claim paradoxes view results game causation new second possible argument reasons closure information problem role way does sentences representations character philosophy true meaning terms semantic
15 control paraconsistent results argument language theories approach results standard interpretation true epistemology human explain terms theories practical true claim physical events metaphysical problem notion meaning
16 objection relevant causal logical result systems principle does probability model memory does cultural human concept certain belief value science concepts second parts belief conception scientific
17 alternative language bayesian different action cognitive epistemic use true elsevier process arguments model reason claim important arguments belief representation way davidsons does new sense knowledge
18 morally prove probabilistic theories algebras view case order truth explanatory thesis beliefs processes life knowledge justice rational information role claim explanation things logic thought arguments
19 actions calculus problem new games approach representation models particular philosophical according principle role way philosophical make certain different recent visual relation present content use ontological
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
fullstrsl = []
for x in range(0,clusternum):
    abstracts = list(drc.loc[drc['cluster'] == x]['abstract'])
    abstracts = ";".join(str(x) for x in abstracts).replace('|',' ').replace('paper',' ').replace('reserved',' ').replace('argue',' ').replace('account',' ').replace('theory',' ')
    fullstrsl.append(abstracts)
    
vec = CountVectorizer( stop_words='english', ngram_range=(2, 2))#Choose CountVectorizer for the most common words in the cluster, TfidfVectorizer for the words with the greatest differentiation value.
X = vec.fit_transform(fullstrsl)
#print(pd.DataFrame(X.toarray(), columns=vec.get_feature_names())) #To look into the vectors. Beware, can take a bit of RAM


clusterfeatures = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
fullscore = []
for x in range(0,clusternum):
    scores = zip(vec.get_feature_names(), np.asarray(X[x,:].sum(axis=0)).ravel())
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    myscores = sorted_scores[0:20]
    
    scorelist = []
    for s in myscores:
        scorelist.append(s[0])
    fullscore.append(scorelist)
display(pd.DataFrame(fullscore).transpose())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0 moral responsibility classical logic robustness analysis natural language epistemic logic mechanistic explanation sleeping beauty elsevier rights indicative conditionals quantum mechanics epistemic justification true belief natural selection character traits elsevier rights combination problem moral responsibility fictional characters cognitive science phenomenal character mental causation possible worlds proper names abstraction principles natural language
1 morally responsible sequent calculus laws nature firstorder logic dynamic epistemic mechanistic explanations philosophy science natural philosophy actual causation elsevier rights phenomenal conservatism knowledge attributions evolutionary biology elsevier rights common sense decision making normative reasons aesthetic value mental states perceptual experience exclusion problem material objects natural kind caesar problem philosophy science
2 alternative possibilities logical consequence coherence measures definite descriptions belief revision multiple realization dutch book explicit mathematics counterfactual dependence philosophy science generality problem practical reasoning natural kinds virtue ethics practical reason harm principle practical reasoning epistemic logic social cognition phenomenal concepts physical properties david lewis definite descriptions humes principle ontological commitment
3 consequence argument theories truth bayesian confirmation valued fields modal logic philosophy science elsevier rights second order possible worlds scientific revolutions justified belief knowledge norm group selection moral view human nature incentives argument moral judgments discourse fiction folk psychology phenomenal consciousness mental properties temporal parts kind terms natural language formal semantics
4 principle alternative truth predicate ceteris paribus logical form modal logics scientific explanation degrees belief general relativity subjunctive conditionals scientific theories basing relation epistemic contextualism developmental biology mimetic desire philosophy science inner speech reasons action fictional entities natural selection knowledge argument causal exclusion common sense direct reference singular terms indeterminacy translation
5 free action paraconsistent logics conditional analysis boolean algebras common knowledge causal relations causal decision action distance causal claims measurement problem perceptual experience norm assertion cultural evolution nicomachean ethics transcendental idealism human beings moral realism possible worlds language thought mental states philosophy mind modal realism possible worlds philosophy mathematics philosophy language
6 manipulation arguments semantic paradoxes measures coherence generalized quantifiers firstorder logic elsevier rights expected utility reverse mathematics counterfactual conditionals wave function true belief external world units selection emotional disorder model pa moral judgments intrinsic value fictional realism mental representation mental state emergent properties states affairs complex demonstratives logical objects scientific theories
7 argument incompatibilism liar paradox dispositional properties natural deduction dynamic logic explanatory power scientific realism fixed point semantics counterfactuals common cause gettier problem gettier cases evolutionary processes mechanical problems categorical imperative moral luck moral properties modal logic false belief perceptual experiences downward causation impossible worlds modal logic julius caesar analyticsynthetic distinction
8 free moral modal logic measure coherence logical consequence kripke models multiple realizability scientific representation does imply truth conditions scientific explanation true beliefs justified belief inclusive fitness moral dilemmas natural philosophy ordinary people practical reasons negative existentials mental content conscious experience mental states possible world attitude ascriptions philosophical investigations elsevier rights
9 van inwagens logical pluralism probabilistic measures subset equal epistemic logics causal inference scientific theories space time causal exclusion interpretation quantum beliefs justified perceptual knowledge evolutionary game social psychology natural science sense embodiment moral facts imaginative resistance mental representations phenomenal properties mental physical time travel general terms bad company common sense
10 direct argument relevant logics analysis dispositions truth conditions public announcements case study van fraassen classical mechanics ramsey test philosophers science doxastic justification pragmatic encroachment population genetics undecidable problems pure reason absolute margin practical reason natural language philosophy mind explanatory gap nonreductive physicalism ordinary objects freges puzzle plural logic grellings paradox
11 alternate possibilities relevance logic confirmation measures greater equal action models molecular biology scientific practice order arithmetic david lewiss explanatory power epistemic status transmission failure tree life absolute goodness space time duty vote public reason paradox fiction cognitive architecture naive realism causal powers david lewiss singular thought symmetric godel language learning
12 possibilities pap sequent calculi conjunction fallacy scalar implicatures boolean games causal explanation beauty problem seventeenth century david lewis scientific knowledge inferential justification epistemic closure niche construction character trait death penalty human dignity moral discourse seeing things cognitive neuroscience representational content problem mental negative truths definite description godel logic natural kinds
13 principle alternate strong kleene categorical properties quantifier elimination cylindric algebras cognitive science book arguments natural numbers epistemic modals scientific realism justified beliefs epistemic luck philosophy biology concept character eighteenth century invisible hand political liberalism fictional names jerry fodor phenomenal content mental events proper parts semantic value philosophical problems ontological categories
14 frankfurt cases propositional logic degree coherence real closed et al causal markov quantum mechanics ramseys theorem theories causation elsevier science justified believing concept knowledge species concepts evaluational internalism human rights legal punishment virtue ethics category mistakes nonhuman animals visual experience intentional action composition identity singular terms abstract objects conceptual realism
15 agent morally relevant logic reliability conducive boolean algebra expressive power markov condition van fraassens recursive functions structural equations scientific practice regress problem doxastic justification evolutionary developmental john doris incompleteness theorem moral concern rational requirements classical logic propositional attitudes conscious states donald davidsons modal logic rigid designation alethic functionalism indeterminacy thesis
16 agent causation truth values possible worlds linear logic predicate logic biological sciences case study theorem pairs causal decision case study thought experiment knowledge ascriptions multilevel selection moral behaviour kants argument moral conflict moral error classical music representational content concept strategy exclusion argument properties relations rigid designators alethic pluralism language acquisition
17 compatibilists incompatibilists yablos paradox probabilistic coherence semantic analysis temporal logic causal structure subjective probability classical logic causal explanation science rights belief justified closure principle social sciences moral virtue constitutive rules moral virtue moral judgements fictional discourse cognitive processes phenomenal concept jaegwon kim argument vagueness view names cardinal numbers logical consequence
18 determinism true paraconsistent logic finks masks algebraically closed van benthem philosophers science probability function experimental philosophy conditional probabilities causal processes perceptual beliefs knowledge assertion species concept prima facie critique pure notion social practical rationality figurative language cognitive systems philosophy mind prima facie abstract objects mental states oxford university logical truths
19 perform action modal logics austere quidditism categorial grammar announcement logic constitutive relevance philosophers science firstorder logic modus ponens configuration space process reliabilism epistemic value recent work problem evil early modern able explain reasons belief gods existence connectionist models hard problem causal closure bare particulars modal epistemology person authority principle charity
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fullstrsl = []
for x in range(0,clusternum):
    authors = list(drc.loc[drc['cluster'] == x]['citedAU'])
    authors = [item for sublist in authors for item in sublist]
    authors = " §".join(str(x) for x in authors)
    authors = ' '.join( [w for w in authors.split() if len(w)>2] )
    fullstrsl.append(authors)

#print(fullstrsl[1])
vec = CountVectorizer(token_pattern=r'[\s\w\.-]+(?=[$|§])')#Choose CountVectorizer for the most common words in the cluster, TfidfVectorizer for the words with the greatest differentiation value.
X = vec.fit_transform(fullstrsl)

clusterfeatures = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
fullscoreA = []
for x in range(0,clusternum):
    scores = zip(vec.get_feature_names(), np.asarray(X[x,:].sum(axis=0)).ravel())
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    myscores = sorted_scores[0:10]
    
    scorelist = []
    for s in myscores:
        scorelist.append(s[0])
    fullscoreA.append(scorelist)
display(pd.DataFrame(fullscoreA).transpose())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0 fischer anderson armstrong barwise hughes craver jeffrey friedman lewis popper goldman williamson timothy hull aristotle kant nozick rawls walton fodor tye davidson lewis kripke wittgenstein quine
1 mele priest lewis chang van benthem woodward hacking newton stalnaker hempel alston cohen sober plato smith feinberg williams plantinga churchland chalmers kim armstrong salmon dummett davidson
2 frankfurt belnap martin bach fagin cartwright lewis earman lewis david kuhn chisholm hawthorne wilson ross hume mill scanlon hintikka dennett dretske davidson donald lewis david kaplan frege carnap
3 van inwagen peter beall ellis partee halpern bechtel van fraassen church adams salmon pollock derose lewontin cooper locke rawls parfit lewis millikan block lewis sider lewis wright russell
4 pereboom dunn bird tarski baltag machamer giere feferman bennett feyerabend bonjour dretske gould irwin kant immanuel hare nagel kripke dretske jackson mele quine soames wright crispin putnam
5 kane robert meyer fitelson lewis gabbay wimsatt savage simpson hitchcock kuhn thomas lehrer lewis mayr none none nagel thomas hare currie cummins dennett fodor kripke donnellan kripke quine wvo
6 ginet kripke mumford kamp van ditmarsch fodor levi cohen jackson einstein harman goldman dawkins aquinas husserl hart cohen currie gregory goldman byrne goldman fine perry frege gottlob goodman
7 strawson field olsson dowty blackburn glennan kyburg troelstra edgington reichenbach sosa sosa smith vlastos strawson none rawls john vaninwagen searle lycan horgan van inwagen evans dummett michael kripke
8 kane routley bird alexander heim irene segerberg kim vanfraassen kreisel woodward bell bonjour laurence pritchard griffiths hett goodman dworkin none carroll putnam armstrong wilson merricks burge wittgenstein ludwig strawson
9 widerker restall molnar heim troelstra schaffner howson westfall schaffer lakatos dretske williamson sterelny nussbaum aristotle williams gibbard salmon stich harman searle lowe putnam strawson frege

Now for the final part that puts it all together into a labeled graph. Note that the labels are my interpretation and reasonable people can very well disagree about them.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
%%R -i embedding,dfclust,myNewColors -o labelpol,cltest --bg #fbf8f1
#-h 1600 -w 1600 -r 140 --bg #fbf8f1

# Some imports:
library(hrbrthemes)
library(ggplot2)
library(fields)
# library(ggrepel)
library(ggforce)
#install.packages('ggalt')
library(ggalt)
library(stringr)



options(warn=0)# 0 zum anschalten

#Get the cluster means:
means <- aggregate(embedding[,c("x","y")], list(dfclust$cluster), mean)
means <- data.frame(means)
#And Variance, for the labels:
test <- aggregate(embedding[,c("x")], list(dfclust$cluster), var)
test <- test[-1,]

n=nrow(means)
means <- means[-1,]
# #Make the colors: 
# mycolors <- c("#dd593c",
# "#ead96c",
# "#df4467",
# "#8d2315",
# "#675d69",
# "#70897b",
# "#131541") 

# pal <- colorRampPalette(sample(mycolors))
# s <- n-1
myGray <- c('#95a5a6')
# myNewColors <- sample(pal(s))
myPal <- append(myGray,myNewColors)

#fonts:
# library(showtext)
# font.add.google(name = "Alegreya Sans SC", family = "SC")
# showtext.auto()


#◇

labels <- c("Responsibility & Free Will","Classical Logic & Paradoxes","Formal Th. of Science & Modeling","Ph. of linguistics & semantics",
            "Formal Epistemology","Causality & Scient. Explanation","Probability & Decision Theory","History of Science",
            "Possible Worlds & Counterfactuals",
            "Quantum Mechanics","Epistemology: Foundationalism","Epistemology: Contextualism","Biology & Evolution",
            "Virtue Ethics","Practical Reason","Liberalism & Legal Ph.","Moral Judgement","Art & Aesthetics","Cognitive Science",
            "Ph. of Mind: Phenomenology","Ph. of Mind: Mental Causation","Possible Worlds: Metaphysics","Reference, Proper Names",
            "Reference of Math. Objects","Quinean Ontology")
labelsb <- c("J. M. Fischer, P. van Inwagen","","","",
            "","N. Cartwright, J. Woodward","Dutch Book, Sleeping Beauty,...","Newton",
            "",
            "","","","",
            "Aristotle","Kant & Hume","","","","",
            "","","","",
            "","")
#circular markers:  
library(gridExtra)

circle <- polygon_regular(100)
pointy_points <- function(x, y, size){
  do.call(rbind, mapply(function(x,y,size,id) 
    data.frame(x=size*circle[,1]+x, y=size*circle[,2]+y, id=id),
         x=x,y=y, size=size, id=seq_along(x), SIMPLIFY=FALSE))
}


#get density, to avoid overplotting
embedding$density <- 1/ as.numeric(fields::interp.surface(MASS::kde2d(embedding$x, embedding$y), embedding[,c("x","y")]))

                        
# get for every label, wether it is in the + or - part of the x-axis:                        
xpol <-abs(means[,c("x")])/means[,c("x")]
ypol <-abs(means[,c("y")])/means[,c("y")]


polfact <- 1.5
 
# build a circle for the labels:
r <- 14.5
sequence <- seq(from = 1, to = s, by = 1)
angles <- 360/s*sequence
angle <-(angles*(pi/180))
         
xlabl <- cos(angle)*r
ylabl <- sin(angle)*r
circlecord <- cbind(xlabl,ylabl)

labelpol <-abs(circlecord[,c("xlabl")])/circlecord[,c("xlabl")]




# install.packages("stringr")
# library(stringr)



coord_x=5
coord_y=5
                        
# define circular markers:
circular_annotations <- pointy_points(means$x, means$y, size=test$x*0.25+1)
embedding <- cbind(embedding,dfclust)
filtered <- as.data.frame(subset(embedding, cluster >= 0))
cltest <- filtered
                        
                        
#Let's plot!
p <- ggplot(data=filtered, aes(x=x, y=y, color= as.factor(cluster), alpha='density'))+
geom_point(pch=20,cex=1.5)+#, alpha = 0/density)#+ 
#scale_color_manual(values = myPal)+
scale_x_continuous(limits=c(-18.5,18.5))+
scale_y_continuous(limits=c(-15,15))
#geom_polygon(data=circular_annotations, aes(x,y,group=factor(id), fill = factor(id)),alpha=0.15)+
#scale_fill_manual(values = myNewColors) +     
                       
# guides(alpha=FALSE, color=FALSE, fill=FALSE)+


q <- ggplot_build(p + stat_density2d(n=800,h=c(1.1,1.1)))$data[[2]]
q <- q[str_detect(q$group, "001") == TRUE, ]
# z <- max(test$x)
# print(z)

o <- aggregate(q$x, list(q$group) , min)
z <- subset(q,subset = q$x %in%  c(o$x))

omax <- aggregate(q$x, list(q$group) , max)
zmax <- subset(q,subset = q$x %in%  c(omax$x))

c <- data.frame()
count <- 1
for (val in xpol) {
if(val <0 ){
    c <- rbind(c, z[count,])
} else {
    c <- rbind(c, zmax[count,])
}
    count <- count + 1
}

contactpoints <-  data.frame(c$x,c$y)

#Append every label to its best fit on that circle, using the hungarian algorithm:
require(clue)

distances <- rdist(circlecord,contactpoints)
sol <- solve_LSAP(t(distances))
solo <- data.frame(cbind(mx=(contactpoints[,1]), my=(contactpoints[,2]), cx=(circlecord[sol, 1]), cy=(circlecord[sol, 2])))
                        
xcpol <-abs(solo$cx)/solo$cx



r <- 
p +
geom_point(data=subset(embedding, cluster == -1), aes(x=x, y=y),pch=20,cex=1.5,alpha=0.2, color=myGray)+                      
geom_polygon(data=q, aes(x,y ,group = as.factor(q$group),fill = as.factor(q$group)),color= NA,alpha=0.3,linetype=1,size=0.6)+ # ,linetype=3, color="black"
#scale_fill_manual(values = myPal)+
theme_ipsum()+
# scale_fill_manual(values = myPal)
guides(alpha=FALSE, color=FALSE, fill=FALSE)+
geom_point(data=solo, aes(x=cx, y=cy), color= myNewColors, alpha = 1,pch=16,size=3, stroke = 1)+                      

labs(x="UMAP-x", y="UMAP-y",
    title="The structure of recent Philosophy",
    subtitle="A umap & hdbscan-cluster-analysis of ~ 50000 papers in philosophy that brings out the major groupings of the discipline.",
    caption="by Maximilian Noichl, 2018")+               
theme(panel.grid.major = element_line(colour = "grey", linetype="dotted", size=0.55),panel.grid.minor = element_blank())+
theme(plot.background = element_rect(fill = "#fbf8f1"))+
#expand_limits(x = c(r+10,0-r-10),y = c(r+10,0-r-10))+
#theme(plot.title = element_text(size=27, family="SC", face="plain"))+
coord_fixed()+
annotate("segment", x = c$x, y = c$y, xend = c$x+xcpol*0.5, yend =c$y, color= myNewColors,alpha=0.3, size = 1)+
annotate("segment", x = c$x+xcpol*0.5, y =c$y, xend = solo$cx, yend = solo$cy, color=myNewColors,alpha=0.3, size = 1)+
# theme(plot.margin=grid::unit(c(0,0,0,0), "mm"))+

#annotate("segment", x = (solo$mx+(test$x*0.25+1)*xcpol)+xcpol*0.3, y = solo$my, xend = solo$cx, yend = solo$cy, color= myNewColors, alpha = 0.25, size = 0.7)+        
annotate("text", x = solo$cx+xcpol*0.3, y = solo$cy+0.19, parse = FALSE, label = labels, color="black", fontface="bold", family = "sans", size=3, hjust=abs((xcpol+1)/2-1),vjust=1)+
annotate("text", x = solo$cx+xcpol*0.3, y = solo$cy+0.3-0.5, parse = FALSE, label = labelsb, color="black", fontface="italic", family = "sans", size=3, hjust=abs((xcpol+1)/2-1),vjust=1)+             
#geom_point(data=c, aes(x,y),color="black")+
scale_color_manual(values = myNewColors)+
scale_fill_manual(values = myNewColors)+
theme(plot.background=element_rect(fill=NA, colour=NA))+



NULL

r
ggsave('plot.png', plot = last_plot(),width=15,height=15,dpi=300)

And Voilà! here we have the graphic from the beginning.

plot.png