A tweet from RiparianData caught my eye the other day:
https://twitter.com/RiparianData/status/222319315800698880
I built getvouched.com with this idea of “expert and expertise discovery” using skill based vouching adjusted by the distance from searcher to target as a way to find rank. So I dug in and found out that Human-computer Information Retrieval (HCIR) combines research from the fields of human-computer interaction (HCI) and information retrieval (IR), placing an emphasis on human involvement in search activities.
The HCIR challenge for this years symposium includes “hiring,” “assembling a conference program,” and “finding people to deliver patent research or expert testimony” as summarized by Patrick Durusau.
I was late to the party (as the deadline to get access to the Mendeley data had passed) but William Gunn and Daniel Tunkelang were kind enough to grant me access.
I got the data via Dropbox, it is mostly tab separated data with one exception which is a JSON dump of publications.
I needed to import this into Neo4j, so I followed the examples from Batch Importer Part 2, and Batch Importer Part 3 to do some ETL, but first I needed to load the data into Postgresql so I could match up the two formats. I’ve outlined how I did this on the HCIR github repo.
What I ended up with was this graph:
publication -[:by_discipline]-> discipline publication -[:by_country]-> country publication -[:by_academic_status]-> academic_status publication -[:authored_by]-> author publication -[:published_in]-> journal author -[:has_profile]-> profile profile -[:interested_in]-> discipline profile -[:member_of]-> group profile -[:knows]-> profile
I also created a “vertices” full text index and an “edges” full text index to make my life easier. Just to make sure it imported ok I tested with:
START authors = node:vertices('type:author') RETURN authors.name LIMIT 3;
Looking good:
==> +------------------+ ==> | authors.name | ==> +------------------+ ==> | "Dominik Papies" | ==> | "Felix Eggers" | ==> | "Nils Wlömert" | ==> +------------------+
I wonder who the most prolific author is?
START authors = node:vertices('type:author') MATCH authors <-[:authored_by]- publication RETURN authors.name, count(publication) AS cnt ORDER BY cnt DESC LIMIT 5;
“Timothy E Hewett” has authored the most publications in our sample data set.
==> +--------------------------+ ==> | authors.name | cnt | ==> +--------------------------+ ==> | "Timothy E Hewett" | 339 | ==> | "Gregory D Myer" | 226 | ==> | "Kevin R Ford" | 202 | ==> | "Felix Gugerli" | 144 | ==> | "K Darowicki" | 143 | ==> +--------------------------+
I wonder how many co-authors he has?
START author = node:vertices('type:author AND name:"Timothy E Hewett"') MATCH author <-[:authored_by]- publication -[:authored_by]-> co_authors RETURN count(DISTINCT co_authors);
That’s a ton of co-authors.
==> +----------------------------+ ==> | count(DISTINCT co_authors) | ==> +----------------------------+ ==> | 280 | ==> +----------------------------+
Let’s pick one author from above and focus in on them.
START me = node:vertices('name:"Felix Eggers"') RETURN me;
Looks like we have him as an author, and we have his profile as well.
==> +-------------------------------------------------------------------+ ==> | me | ==> +-------------------------------------------------------------------+ ==> | Node[17]{name:"Felix Eggers",type:"author",node_id:"17"} | ==> | Node[400573]{name:"Felix Eggers",type:"profile",node_id:"400573"} | ==> +-------------------------------------------------------------------+
So let’s say that Felix is trying to hire someone like him or assemble a conference program of a research topic he is interested in. We can try to find people who are like Felix a number of different ways:
By Contacts:
We can start with the simple case of who does Felix know?
START me = node:vertices('type:profile AND name:"Felix Eggers"') MATCH me -[:knows]-> profiles RETURN DISTINCT profiles.name LIMIT 5;
5 out of the 7 authors Felix knows:
==> +-------------------+ ==> | profiles.name | ==> +-------------------+ ==> | "Jens Hogreve" | ==> | "Mathias Lin" | ==> | "Fabian Eggers" | ==> | "Tillmann Wagner" | ==> | "Andreas Neus" | ==> +-------------------+
Felix doesn’t know a whole lot of other authors, let’s expand his network one more level.
START me = node:vertices('type:profile AND name:"Felix Eggers"') MATCH me -[:knows]-> () -[:knows]-> profiles RETURN DISTINCT profiles.name; LIMIT 5;
5 out of the 16 contacts his contacts know:
==> +------------------------+ ==> | profiles.name | ==> +------------------------+ ==> | "Victor Henning" | ==> | "Jens Hogreve" | ==> | "Charles Hofacker" | ==> | "Stephanie Feiereisen" | ==> | "Alexander Stich" | ==> +------------------------+
Members of the same groups:
Let see what research groups Felix is in, and who else is in those groups.
START me = node:vertices('type:profile AND name:"Felix Eggers"') MATCH me -[:member_of]-> group <-[:member_of]- other_profiles RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT group.name) ORDER BY COUNT(*) DESC LIMIT 5;
We find Jeremy and Michael are in the same group as Felix.
==> +-----------------------------------------------------------------------------+ ==> | other_profiles.name | COLLECT(DISTINCT group.name) | ==> +-----------------------------------------------------------------------------+ ==> | "Jeremy Chen" | ["Conjoint Analysis and Discrete Choice Experiments"] | ==> | "Michael Waltinger" | ["Conjoint Analysis and Discrete Choice Experiments"] | ==> +-----------------------------------------------------------------------------+
Are they in any other groups that can help us expand Felix’s network?
START me = node:vertices('type:profile AND name:"Felix Eggers"') MATCH me -[:member_of]-> group <-[:member_of]- team_members -[:member_of]-> other_group <-[:member_of]- other_profiles RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT other_group.name) ORDER BY COUNT(*) DESC LIMIT 5;
Some of those folks are in a ton of groups, let’s just count them so it will be easier to display.
START me = node:vertices('type:profile AND name:"Felix Eggers"') MATCH me -[:member_of]-> group <-[:member_of]- team_members -[:member_of]-> other_group <-[:member_of]- other_profiles RETURN DISTINCT other_profiles.name, COUNT( DISTINCT other_group.name) ORDER BY COUNT(*) DESC LIMIT 5;
==> +-----------------------------------------------------------+ ==> | other_profiles.name | COUNT( DISTINCT other_group.name) | ==> +-----------------------------------------------------------+ ==> | "ABDUL SALAM YUSSIF" | 12 | ==> | "Nicholas Overton" | 9 | ==> | "Ashley Cooke" | 7 | ==> | "Joe Reevy" | 6 | ==> | "Moeez Khademhoseiny" | 6 | ==> +-----------------------------------------------------------+
Co-Authors:
START me = node:vertices('type:author AND name:"Felix Eggers"') MATCH me <-[:authored_by]- publication -[:authored_by]-> co_authors RETURN DISTINCT co_authors.name;
These folks co-authored a publication with Felix, so they must like working together, and share similar research interests.
==> +--------------------------+ ==> | co_authors.name | ==> +--------------------------+ ==> | "Victor Henning" | ==> | "Thorsten Hennig-Thurau" | ==> | "Dominik Papies" | ==> | "Fabian Eggers" | ==> | "Henrik Sattler" | ==> | "Mark B Houston" | ==> | "Nils Wlömert" | ==> +--------------------------+
That’s not a ton of people, let’s try his 2nd level co-author network:
START me = node:vertices('type:author AND name:"Felix Eggers"') MATCH me <-[:authored_by]- my_publications -[:authored_by]-> co_authors <-[:authored_by]- their_publications -[:authored_by]-> their_co_authors WHERE me <> their_co_authors AND NOT(me <-[:authored_by]- my_publications -[:authored_by]-> their_co_authors) RETURN DISTINCT their_co_authors.name, COUNT(*) AS cnt ORDER BY cnt DESC LIMIT 5;
We are excluding Felix and his co-authors from the result. I found 18, but here are the top 5:
==> +-----------------------------+ ==> | their_co_authors.name | cnt | ==> +-----------------------------+ ==> | "Jan Reichelt" | 27 | ==> | "Jason J Hoyt" | 21 | ==> | "James Hammerton" | 15 | ==> | "Kris Jack" | 15 | ==> | "Dan Harvey" | 15 | ==> +-----------------------------+
In the same Journal:
We can also take look at authors who appeared in the same Journal as Felix since Journals are usually topic specific and curated for high quality content.
START me = node:vertices('type:author AND name:"Felix Eggers"') MATCH me <-[:authored_by]- my_publications -[:published_in]-> journal <-[:published_in]- other_publications -[:authored_by]-> authors RETURN DISTINCT authors.name, COUNT(*) AS cnt ORDER BY cnt DESC LIMIT 5;
I found 35 authors who were published in the same journals, but here are the top 5:
==> +--------------------------------+ ==> | authors.name | cnt | ==> +--------------------------------+ ==> | "Thorsten Hennig-Thurau" | 7 | ==> | "Victor Henning" | 4 | ==> | "Henrik Sattler" | 4 | ==> | "Tillmann Wagner" | 4 | ==> | "Richard J Lutz" | 2 | ==> +--------------------------------+
We can go to a 2nd level here by using his co-authors:
START me = node:vertices('type:author AND name:"Felix Eggers"') MATCH me <-[:authored_by]- my_publications -[:authored_by]-> co_authors <-[:authored_by]- their_publications -[:published_in]-> journal <-[:published_in]- other_publications -[:authored_by]-> authors WHERE me <> authors AND NOT(me <-[:authored_by]- my_publications -[:authored_by]-> authors) RETURN DISTINCT authors.name, COUNT(*) AS cnt ORDER BY cnt DESC LIMIT 5;
That query returns 19.5k authors, here are the top 5:
==> +-------------------------+ ==> | authors.name | cnt | ==> +-------------------------+ ==> | "Thomas Cochrane" | 444 | ==> | "Amanda Peters" | 360 | ==> | "David Jones" | 336 | ==> | "DJ Riddell" | 312 | ==> | "J Lavoué" | 300 | ==> +-------------------------+
This list represents authors who have appeared in the same journals as his co-authors ordered by the number of paths that exist to them.
Interested the same Disciplines:
We can actually go multiple ways here.
From his profile we can go to disciplines and find other profiles who are into the same disciplines.
START me = node:vertices('type:profile AND name:"Felix Eggers"') MATCH me -[:interested_in]-> disciplines <-[:interested_in]- other_profiles RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT disciplines.name) ORDER BY COUNT(*) DESC LIMIT 5;
That’s going to return a ton of people who are also into Business Administration, here are 5 of them:
==> +----------------------------------------------------------+ ==> | other_profiles.name | COLLECT(DISTINCT disciplines.name) | ==> +----------------------------------------------------------+ ==> | "John Smith" | ["Business Administration"] | ==> | "Andreas Müller" | ["Business Administration"] | ==> | "abc abc" | ["Business Administration"] | ==> | "abc def" | ["Business Administration"] | ==> | "Luis Farinha" | ["Business Administration"] | ==> +----------------------------------------------------------+
Since we know Felix is interested in Business Administration, we can also go from disciplines to publications, to other authors who may not have a profile in the system.
START me = node:vertices('type:discipline AND name:"Business Administration"') MATCH me -[:by_discipline]- publications -[:authored_by]- author RETURN author.name, COUNT(*) AS cnt ORDER BY cnt DESC LIMIT 5;
==> +---------------------------+ ==> | author.name | cnt | ==> +---------------------------+ ==> | "Null Mancas Matei" | 11 | ==> | "Joanne Dyer" | 8 | ==> | "Nicholas J Turro" | 8 | ==> | "Steffen Jockusch" | 4 | ==> | "Angel A Martí" | 4 | ==> +---------------------------+
Anyway, that was just bit of exploring of the data with Neo4j and Cypher. I’ll try to build a website that makes use of these queries before the August 31st deadline. Leave a comment if you have any ideas or want to help.
[…] is available on my public drop box account, so you can play with it to your hearts content. See this old blog post for some ideas (beware however that Cypher syntax has changed a bit since those […]