Semantic Discovery in Knowledge Graphs

1. The Discovery Problem

Building a personal knowledge graph is easy. Making it useful is hard. The value of a knowledge graph isn't in the nodes — it's in the edges, the connections that reveal how ideas relate.

Manually creating every edge is tedious and limited by what you already know. The interesting connections are often ones you haven't thought of yet. MEMEX solves this by automatically discovering semantic relationships using structured data from Wikidata and link analysis from Wikipedia.

2. Data Sources

2.1 Wikidata

Wikidata is a free, structured knowledge base maintained by the Wikimedia Foundation. Every entity (person, concept, work) has a unique identifier (Q-number) and a set of properties (P-numbers) linking to other entities.

For intellectual history, the most valuable properties are:

Property	Name	Example
P737	influenced by	McLuhan → Innis
P738	influenced	Innis → McLuhan
P184	doctoral advisor	Heidegger → Husserl
P185	doctoral student	Husserl → Heidegger
P1066	student of	Žižek → Lacan (informal)
P135	movement	Derrida → Deconstruction
P101	field of work	Kittler → Media theory
P108	employer	McLuhan → U of Toronto
P800	notable work	McLuhan → Understanding Media

2.2 Wikipedia Link Graph

Wikipedia articles link to other articles. These links encode semantic relationships — when an editor writes that "McLuhan was influenced by Harold Innis," they create a hyperlink. Millions of such editorial decisions form a distributed ontology.

The Wikipedia API provides two relevant endpoints:

action=query&prop=links — Outgoing links from an article
action=query&list=backlinks — Incoming links to an article

3. Connection Types

MEMEX discovers four types of connections:

3.1 Explicit Wikidata Relationships

Direct property links between entities. These are the highest confidence connections — someone explicitly stated this relationship in structured form.

// Query: Who influenced entity Q5878?
SELECT ?influenced ?influencedLabel WHERE {
  wd:Q5878 wdt:P737 ?influenced.
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
  }
}

3.2 Shared Property Membership

Two entities share a categorical property value. If both McLuhan and Kittler have P101 = Media theory, that's a connection — they work in the same field.

// Find shared movements between two entities
function findSharedProperties(entityA, entityB) {
  const propsA = await getEntityProperties(entityA);
  const propsB = await getEntityProperties(entityB);
  
  const shared = [];
  for (const [prop, valuesA] of Object.entries(propsA)) {
    const valuesB = propsB[prop] || [];
    const overlap = valuesA.filter(v => valuesB.includes(v));
    if (overlap.length > 0) {
      shared.push({ property: prop, values: overlap });
    }
  }
  return shared;
}

3.3 Wikipedia Cross-References

If Wikipedia article A links to Wikipedia article B, there's likely a semantic relationship. This catches connections that haven't been formalized in Wikidata.

// Check if article A links to article B
async function checkWikipediaLink(titleA, titleB) {
  const response = await fetch(
    `https://en.wikipedia.org/w/api.php?` +
    `action=query&titles=${titleA}&prop=links&pltitles=${titleB}&format=json`
  );
  const data = await response.json();
  const pages = Object.values(data.query.pages);
  return pages[0].links?.length > 0;
}

3.4 Manual Connections

User-created links with custom relationship types: influences, extends, contradicts, cites, applies-to, etc. These capture insights that aren't encoded in any external database.

4. Discovery Algorithm

4.1 On Node Addition

When a user adds a new entity to their graph:

Fetch Wikidata entity by label search
Retrieve all relevant properties (P737, P135, P101, etc.)
For each property value, check if it matches an existing node
Create edges for matches
Optionally fetch Wikipedia links and check against existing nodes

async function discoverConnections(newNode, existingNodes) {
  const connections = [];
  
  // 1. Direct Wikidata relationships
  const wdProps = await fetchWikidataProperties(newNode.wikidataId);
  
  for (const node of existingNodes) {
    // Check if new node influences or is influenced by existing
    if (wdProps.influencedBy?.includes(node.wikidataId)) {
      connections.push({
        source: newNode.id,
        target: node.id,
        type: 'influenced_by',
        source: 'wikidata'
      });
    }
    
    // Check shared movements
    const sharedMovements = intersect(
      wdProps.movements, 
      node.properties?.movements
    );
    if (sharedMovements.length > 0) {
      connections.push({
        source: newNode.id,
        target: node.id,
        type: 'shared_movement',
        data: sharedMovements
      });
    }
  }
  
  // 2. Wikipedia link analysis
  const wikiLinks = await fetchWikipediaLinks(newNode.wikipediaTitle);
  
  for (const node of existingNodes) {
    if (wikiLinks.includes(node.wikipediaTitle)) {
      connections.push({
        source: newNode.id,
        target: node.id,
        type: 'wikipedia_link'
      });
    }
  }
  
  return connections;
}

4.2 Full Graph Scan

The "Discover Links" function runs pairwise comparison across all nodes, finding connections that might have been missed on initial addition (e.g., if node B was added before node A had its properties fetched).

5. Edge Visualization

Different connection types render differently:

Type	Style	Color
Wikidata influence	Solid, directed arrow	Orange
Shared properties	Dashed	Yellow
Wikipedia links	Thin solid	Blue
Manual	Solid	Green

This visual hierarchy lets users quickly distinguish curated relationships (Wikidata) from inferred ones (Wikipedia links) from personal annotations (manual).

6. SPARQL for Complex Queries

For advanced discovery, MEMEX can execute SPARQL queries against the Wikidata Query Service. Example: find all entities that influenced both McLuhan AND Kittler:

SELECT ?influencer ?influencerLabel WHERE {
  wd:Q193489 wdt:P737 ?influencer.  # McLuhan influenced by
  wd:Q77116 wdt:P737 ?influencer.   # Kittler influenced by
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
  }
}

This returns Harold Innis — a node the user might not have thought to add but which connects two existing nodes in their graph.

7. The Emergent Ontology Thesis

Wikipedia's link structure isn't random. When thousands of editors independently decide which articles to link, they're making semantic judgments. The aggregate of these judgments forms an emergent ontology — a map of conceptual relationships that no single person designed but that reflects collective human knowledge organization.

Key Insight

Wikipedia links are editorial assertions. A link from article A to article B means an editor believed, at some point, that readers of A would benefit from knowing about B. This is semantic data hiding in hypertext.

MEMEX treats Wikipedia not as a source of text to read, but as a graph to traverse. The articles are nodes; the links are edges; the structure is the knowledge.

8. Limitations

8.1 Wikidata Coverage

Wikidata's "influenced by" property (P737) is unevenly populated. Major philosophers have extensive entries; obscure academics may have none. The system falls back to Wikipedia links when Wikidata is sparse.

8.2 Link Noise

Not every Wikipedia link is semantically meaningful. An article might link to "United States" or "1945" — these are navigational, not conceptual. MEMEX filters common low-signal targets.

8.3 Directionality

Wikipedia links are one-way. If A links to B, we know A's article mentions B, but not vice versa. Wikidata's inverse properties (P737/P738) solve this for influence relationships, but not for all connection types.

9. Future Work

OpenAlex integration — Citation networks from academic literature, showing who actually cited whom (not just who Wikipedia says influenced whom)
Path finding — SPARQL queries to find shortest path between two entities through the Wikidata graph
Clustering — Automatic grouping of nodes by shared properties (all Frankfurt School members, all media theorists)
Temporal visualization — Timeline view showing when influences could have occurred based on birth/death dates

10. Conclusion

Personal knowledge management tools typically treat connections as something users must create manually. MEMEX inverts this: connections are discovered automatically from the semantic structure already encoded in Wikidata and Wikipedia. The user's job shifts from link creation to link curation — reviewing discovered connections, adding manual refinements, and building a personalized map of how ideas relate.

The underlying thesis is that the web already contains a vast, implicit knowledge graph. MEMEX makes it explicit and personal.