diff --git a/01_index.markdown b/01_index.markdown index 47946cc..d416c7d 100644 --- a/01_index.markdown +++ b/01_index.markdown @@ -13,97 +13,152 @@ background: home/bg.png ---
+

Knowledge Graph Construction

-
- Graphlet AI is a data engineering, data science and artificial intelligence consultancy specializing in knowledge graph construction, also known as property graph construction. We transform and refine raw data on your data lake to build large networks ranging in the millions, billions or even trillions of nodes and edges that model entire business domains to solve complex problems with global footprints. We use big data tools and go beyond simple ETL by using machine learning and artificial intelligence to construct a graph model of your business domain that maps closely to solutions to your business problems. Using a modern graph database, your data science and machine learning teams can then efficiently mine this refined graph to find solutions to your most pressing data science problems. +
+
+ Graphlet AI is a data engineering, data science and artificial intelligence consultancy specializing in knowledge graph construction, also known as property graph construction. We build data pipelines that take raw data and feed your graph database clean data. +
+
+
+ We transform and refine raw data on your data lake to build large networks ranging in the millions, billions or even trillions of nodes and edges that model entire business domains to solve complex problems with global footprints. +
+
+
+ We love big data and large networks. We use big data tools to scale data pipelines that go beyond traditional ETL and entity resolution using artificial intelligenecs - graph machine learning - to construct a high fidelity network model of your business domain that maps directly to solutions to your business problems. It lets you run the queries that answer problems vexing you and driving features your customers demand. Using a modern graph database, your data science and machine learning teams can then efficiently mine this refined graph to find solutions to your most pressing data science problems. +
-

We build knowledge graph factories

+
+

We build knowledge graph factories

- Knowledge Graph Construction Architecture + Knowledge Graph Construction Architecture
-

We build property graphs in 3 steps

-
1) Transform myriad datasets into a common ontology. This means we Extract, Transform, Load (ETL) [or ELT] multiple, large and small datasets from different sources with different formats into a common property graph schema using tools like Python, PySpark, Databricks or Snowflake. How much ETL varies by industry from minimal with cybersecurity applications to simplified graph model with fewer makes it easy to access, query, analyze and model in a graph database such as Neo4j, TigerGraph, ArangoDB or Neptune.
+
+

We build property graphs in 3 steps

+
+
+ 1) Transform myriad datasets into a common ontology. This means we Extract, Transform, Load (ETL) [or ELT] multiple, large and small datasets from different sources with different formats into a common property graph schema using tools like Python, PySpark, Databricks or Snowflake. How much ETL varies by industry from minimal with cybersecurity applications to simplified graph model with fewer makes it easy to access, query, analyze and model in a graph database such as Neo4j, TigerGraph, ArangoDB or Neptune. +
+
- Raw Data in Bronze Tables + Raw Data in Bronze Tables
- Transformed, Cleaned Data in Silver Tables + Transformed, Cleaned Data in Silver Tables
+
+
+ 2) Extract a graph from text using Natural Language Processing (NLP) via a chain of operations: NER —> IE —> EL. Named entity recognition (NER) points out entities corresponding to nodes. Information Extraction (IE) creates relationships [edges] between entities. Entity linking links nodes and edges extracted from text documents into single into a core graph established via ETL.
-
-
-
-
-

What is a property graph, knowledge graph and triple store?

+
- A property graph is a set of objects representing nodes [also known as vertex/vertices] and edges [also known as links]. + Initially a process of exploratory data analysis (EDA) reveals patterns that can be used to handle the combinatoric problems arising from the need in entity matching to compare every node in the graph with every node. The complexity of this comparison is n^2, where n is the number of nodes. This can quickly get out of hand with millions or billions of nodes! Blocking is a strategy to prune the set of nodes compared down to groups that are more manageable.
-

Why should I build a knowledge graph for my business?

-
I'll let you in on a secret that is driving the popularity of enterprise knowledge graphs, property graphs, graph databases and Graph Neural Networks (GNNs): MOST DATA IS GRAPH DATA. To compose a single table to get the corresponding vectors, matrices and tensors we load into GPUs to drive machine learning algorithms, several tables have usually been combined [squashed] into one table. There's a problem with this... it is a lossy process. We threw away the relationships. Graph neural networks are able to learn better to build more powerful models because they have a greater potential by matching the structure of the data’s entities and their relationships. +
+
+
+
+ Raw Data in Bronze Tables +
-
-
-

What’s the real story with Graph Neural Networks (GNNs)?

-
I'll let you in on a secret that is driving the popularity of enterprise knowledge graphs, property graphs, graph databases and Graph Neural Networks (GNNs): MOST DATA IS GRAPH DATA. To compose a single table to get the corresponding vectors, matrices and tensors we load into GPUs to drive machine learning algorithms, several tables have usually been combined [squashed] into one table. There's a problem with this... it is a lossy process. We threw away the relationships. Graph neural networks are able to learn better to build more powerful models because they have a greater potential by matching the structure of the data’s entities and their relationships. +
+
+
+ Raw Data in Bronze Tables +
+
+
+ 3) Entity resolution using network topology and natural language processing. Recent developments in Large Language Models [LLMs] and Graph Neural Networks (GNNs) allow us to encode nodes and edges as XML-like text using a language model and then combine them based on semantic inferences made by the LLM in combination with those made about the network via a GNN. LLMs have seen many similar documents as the nodes’ text representation on the world wide web. +
+
+
+ Manual blocking and matching for numerous datasets is a cumbersome and expensive activity. Advances in AI - representation learning and an architecture from Google called Grale - make a generic entity resolution (ER) system possible. This system is configurable to work across multiple datasets by embedding records using large language models (LLMs) such as GPT-3 or ChatGPT, but tuned specifically for the entity matching task. +
+
+
+
+ Raw Data in Bronze Tables +
+
+
+
+
+ Raw Data in Bronze Tables +
+
+
+
+
+ Raw Data in Bronze Tables +
+
+
+
+
+ Raw Data in Bronze Tables +
+
+
+
+
- -
-
- GNN Problem Types -
-
-
-

I use a certain tool or platform. Can you help me?

- We can build knowledge graphs for any platform, but here are a few tools that are more up our alley to create business value using graphs and networks: +
+

I use a certain tool or platform. Can you help me?

+
+ We can build knowledge graphs for any platform, but here are a few tools that are more up our alley to create business value using graphs and networks: +
  • @@ -117,50 +172,22 @@ background: home/bg.png
  • -
    -

    Principal Consultant

    - My name is Russell Jurney. I work at the intersection of big data, large networks - property graphs or knowledge graphs, representation learning with Graph Neural Networks (GNNs), Natural Language Processing (NLP) and Understanding (NLU), model explainability using network visualization and vector search for information retrieval. - I am a startup product and engineering executive focused on building products driven by billion node+ networks. I have worked at cool places like Ning, LinkedIn and Hortonworks. I co-founded Deep Discovery to use networks, GNNs and visualizations to build an explainable risk score for KYC / AML. - - I am a four-time O'Reilly author with 120 citations on Google Scholar for being the first to write about “agile data science” - agile development as applied to data science and machine learning. I am an applied researcher and product manager with 17 years of experience building and shipping data-driven products. - I am currently fascinated by knowledge graph / property graph construction, graph representation learning, graph neural networks (GNNs), NLP/NLU techniques such as information extraction, named entity resolution (NER), coreference resolution, fact extraction, and entity linking. I do network science and machine learning - so I get stuff done :) - Check out my network science portfolio, my blog and my O’Reilly Radar posts. -
    -
    \ No newline at end of file diff --git a/_config.yml b/_config.yml index 594f0e8..196d4f0 100644 --- a/_config.yml +++ b/_config.yml @@ -25,8 +25,8 @@ description: >- # this means to ignore newlines until "baseurl:" out of myriad sources of structured and unstructured data using big data tools. baseurl: "" # the subpath of your site, e.g. /blog url: "" # the base hostname & protocol for your site, e.g. http://example.com -twitter_username: jekyllrb -github_username: jekyll +twitter_username: rjurney +github_username: rjurney phone: 570-758-5858 # Build settings diff --git a/assets/home/russell_jurney_headshot.jpg b/assets/home/russell_jurney_headshot.jpg new file mode 100644 index 0000000..2fee9e8 Binary files /dev/null and b/assets/home/russell_jurney_headshot.jpg differ diff --git a/assets/icons/favicon-16x16.png b/assets/icons/favicon-16x16.png new file mode 100644 index 0000000..4b2e701 Binary files /dev/null and b/assets/icons/favicon-16x16.png differ diff --git a/assets/icons/favicon-32x32.png b/assets/icons/favicon-32x32.png new file mode 100644 index 0000000..9f2c854 Binary files /dev/null and b/assets/icons/favicon-32x32.png differ diff --git a/assets/icons/favicon.ico b/assets/icons/favicon.ico new file mode 100644 index 0000000..4351c24 Binary files /dev/null and b/assets/icons/favicon.ico differ diff --git a/assets/slides/Entity-Resolution---Ditto-Encoding.jpg b/assets/slides/Entity-Resolution---Ditto-Encoding.jpg new file mode 100644 index 0000000..98afcea Binary files /dev/null and b/assets/slides/Entity-Resolution---Ditto-Encoding.jpg differ diff --git a/assets/slides/Entity-Resolution-Phase-2---Blocking.jpg b/assets/slides/Entity-Resolution-Phase-2---Blocking.jpg new file mode 100644 index 0000000..c87f6f5 Binary files /dev/null and b/assets/slides/Entity-Resolution-Phase-2---Blocking.jpg differ diff --git a/assets/slides/Entity-Resolution-Phase-2---Manual-Matching.jpg b/assets/slides/Entity-Resolution-Phase-2---Manual-Matching.jpg new file mode 100644 index 0000000..b16404b Binary files /dev/null and b/assets/slides/Entity-Resolution-Phase-2---Manual-Matching.jpg differ diff --git a/assets/slides/Entity-Resolution-Phase-3---Embedding-Distance.jpg b/assets/slides/Entity-Resolution-Phase-3---Embedding-Distance.jpg new file mode 100644 index 0000000..8f3bf8a Binary files /dev/null and b/assets/slides/Entity-Resolution-Phase-3---Embedding-Distance.jpg differ diff --git a/assets/slides/Entity-Resolution-Phase-3---Fine-Tuned-Classifier.jpg b/assets/slides/Entity-Resolution-Phase-3---Fine-Tuned-Classifier.jpg new file mode 100644 index 0000000..4db9d30 Binary files /dev/null and b/assets/slides/Entity-Resolution-Phase-3---Fine-Tuned-Classifier.jpg differ diff --git a/assets/slides/Entity-Resolution-Phase-3---LSH-Blocking.jpg b/assets/slides/Entity-Resolution-Phase-3---LSH-Blocking.jpg new file mode 100644 index 0000000..3f021f5 Binary files /dev/null and b/assets/slides/Entity-Resolution-Phase-3---LSH-Blocking.jpg differ diff --git a/assets/slides/KG-Factory-System-Architecture-Diagram.jpg b/assets/slides/KG-Factory-System-Architecture-Diagram.jpg index 30fc020..7f2dba6 100644 Binary files a/assets/slides/KG-Factory-System-Architecture-Diagram.jpg and b/assets/slides/KG-Factory-System-Architecture-Diagram.jpg differ diff --git a/assets/slides/RDF-Triple-Stores-vs-Property-Graphs-Small.jpg b/assets/slides/RDF-Triple-Stores-vs-Property-Graphs-Small.jpg new file mode 100644 index 0000000..10bf4ed Binary files /dev/null and b/assets/slides/RDF-Triple-Stores-vs-Property-Graphs-Small.jpg differ diff --git a/assets/slides/RDF-Triple-Stores-vs-Property-Graphs.jpg b/assets/slides/RDF-Triple-Stores-vs-Property-Graphs.jpg new file mode 100644 index 0000000..84db4a3 Binary files /dev/null and b/assets/slides/RDF-Triple-Stores-vs-Property-Graphs.jpg differ