Technical Report · October 2024

Semantic Discovery in Knowledge Graphs

How MEMEX uses Wikidata properties, Wikipedia hyperlink analysis, and graph traversal to surface non-obvious connections between thinkers, concepts, and texts in personal knowledge bases.

1. The Discovery Problem

Building a personal knowledge graph is easy. Making it useful is hard. The value of a knowledge graph isn't in the nodes — it's in the edges, the connections that reveal how ideas relate.

Manually creating every edge is tedious and limited by what you already know. The interesting connections are often ones you haven't thought of yet. MEMEX solves this by automatically discovering semantic relationships using structured data from Wikidata and link analysis from Wikipedia.

2. Data Sources

2.1 Wikidata

Wikidata is a free, structured knowledge base maintained by the Wikimedia Foundation. Every entity (person, concept, work) has a unique identifier (Q-number) and a set of properties (P-numbers) linking to other entities.

For intellectual history, the most valuable properties are:

Property Name Example
P737 influenced by McLuhan → Innis
P738 influenced Innis → McLuhan
P184 doctoral advisor Heidegger → Husserl
P185 doctoral student Husserl → Heidegger
P1066 student of Žižek → Lacan (informal)
P135 movement Derrida → Deconstruction
P101 field of work Kittler → Media theory
P108 employer McLuhan → U of Toronto
P800 notable work McLuhan → Understanding Media

2.2 Wikipedia Link Graph

Wikipedia articles link to other articles. These links encode semantic relationships — when an editor writes that "McLuhan was influenced by Harold Innis," they create a hyperlink. Millions of such editorial decisions form a distributed ontology.

The Wikipedia API provides two relevant endpoints:

3. Connection Types

MEMEX discovers four types of connections:

3.1 Explicit Wikidata Relationships

Direct property links between entities. These are the highest confidence connections — someone explicitly stated this relationship in structured form.

// Query: Who influenced entity Q5878?
SELECT ?influenced ?influencedLabel WHERE {
  wd:Q5878 wdt:P737 ?influenced.
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
  }
}

3.2 Shared Property Membership

Two entities share a categorical property value. If both McLuhan and Kittler have P101 = Media theory, that's a connection — they work in the same field.

// Find shared movements between two entities
function findSharedProperties(entityA, entityB) {
  const propsA = await getEntityProperties(entityA);
  const propsB = await getEntityProperties(entityB);
  
  const shared = [];
  for (const [prop, valuesA] of Object.entries(propsA)) {
    const valuesB = propsB[prop] || [];
    const overlap = valuesA.filter(v => valuesB.includes(v));
    if (overlap.length > 0) {
      shared.push({ property: prop, values: overlap });
    }
  }
  return shared;
}

3.3 Wikipedia Cross-References

If Wikipedia article A links to Wikipedia article B, there's likely a semantic relationship. This catches connections that haven't been formalized in Wikidata.

// Check if article A links to article B
async function checkWikipediaLink(titleA, titleB) {
  const response = await fetch(
    `https://en.wikipedia.org/w/api.php?` +
    `action=query&titles=${titleA}&prop=links&pltitles=${titleB}&format=json`
  );
  const data = await response.json();
  const pages = Object.values(data.query.pages);
  return pages[0].links?.length > 0;
}

3.4 Manual Connections

User-created links with custom relationship types: influences, extends, contradicts, cites, applies-to, etc. These capture insights that aren't encoded in any external database.

4. Discovery Algorithm

4.1 On Node Addition

When a user adds a new entity to their graph:

  1. Fetch Wikidata entity by label search
  2. Retrieve all relevant properties (P737, P135, P101, etc.)
  3. For each property value, check if it matches an existing node
  4. Create edges for matches
  5. Optionally fetch Wikipedia links and check against existing nodes
async function discoverConnections(newNode, existingNodes) {
  const connections = [];
  
  // 1. Direct Wikidata relationships
  const wdProps = await fetchWikidataProperties(newNode.wikidataId);
  
  for (const node of existingNodes) {
    // Check if new node influences or is influenced by existing
    if (wdProps.influencedBy?.includes(node.wikidataId)) {
      connections.push({
        source: newNode.id,
        target: node.id,
        type: 'influenced_by',
        source: 'wikidata'
      });
    }
    
    // Check shared movements
    const sharedMovements = intersect(
      wdProps.movements, 
      node.properties?.movements
    );
    if (sharedMovements.length > 0) {
      connections.push({
        source: newNode.id,
        target: node.id,
        type: 'shared_movement',
        data: sharedMovements
      });
    }
  }
  
  // 2. Wikipedia link analysis
  const wikiLinks = await fetchWikipediaLinks(newNode.wikipediaTitle);
  
  for (const node of existingNodes) {
    if (wikiLinks.includes(node.wikipediaTitle)) {
      connections.push({
        source: newNode.id,
        target: node.id,
        type: 'wikipedia_link'
      });
    }
  }
  
  return connections;
}

4.2 Full Graph Scan

The "Discover Links" function runs pairwise comparison across all nodes, finding connections that might have been missed on initial addition (e.g., if node B was added before node A had its properties fetched).

5. Edge Visualization

Different connection types render differently:

Type Style Color
Wikidata influence Solid, directed arrow Orange
Shared properties Dashed Yellow
Wikipedia links Thin solid Blue
Manual Solid Green

This visual hierarchy lets users quickly distinguish curated relationships (Wikidata) from inferred ones (Wikipedia links) from personal annotations (manual).

6. SPARQL for Complex Queries

For advanced discovery, MEMEX can execute SPARQL queries against the Wikidata Query Service. Example: find all entities that influenced both McLuhan AND Kittler:

SELECT ?influencer ?influencerLabel WHERE {
  wd:Q193489 wdt:P737 ?influencer.  # McLuhan influenced by
  wd:Q77116 wdt:P737 ?influencer.   # Kittler influenced by
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
  }
}

This returns Harold Innis — a node the user might not have thought to add but which connects two existing nodes in their graph.

7. The Emergent Ontology Thesis

Wikipedia's link structure isn't random. When thousands of editors independently decide which articles to link, they're making semantic judgments. The aggregate of these judgments forms an emergent ontology — a map of conceptual relationships that no single person designed but that reflects collective human knowledge organization.

Key Insight
Wikipedia links are editorial assertions. A link from article A to article B means an editor believed, at some point, that readers of A would benefit from knowing about B. This is semantic data hiding in hypertext.

MEMEX treats Wikipedia not as a source of text to read, but as a graph to traverse. The articles are nodes; the links are edges; the structure is the knowledge.

8. Limitations

8.1 Wikidata Coverage

Wikidata's "influenced by" property (P737) is unevenly populated. Major philosophers have extensive entries; obscure academics may have none. The system falls back to Wikipedia links when Wikidata is sparse.

8.2 Link Noise

Not every Wikipedia link is semantically meaningful. An article might link to "United States" or "1945" — these are navigational, not conceptual. MEMEX filters common low-signal targets.

8.3 Directionality

Wikipedia links are one-way. If A links to B, we know A's article mentions B, but not vice versa. Wikidata's inverse properties (P737/P738) solve this for influence relationships, but not for all connection types.

9. Future Work

10. Conclusion

Personal knowledge management tools typically treat connections as something users must create manually. MEMEX inverts this: connections are discovered automatically from the semantic structure already encoded in Wikidata and Wikipedia. The user's job shifts from link creation to link curation — reviewing discovered connections, adding manual refinements, and building a personalized map of how ideas relate.

The underlying thesis is that the web already contains a vast, implicit knowledge graph. MEMEX makes it explicit and personal.