vO.3 Release Notes¶
Introducing Blue Brain Nexus Forge¶
This is an initial release of Nexus Forge and a first step towards our goal to make building and using Knowledge Graphs easier.
Knowledge Graphs are often built from heterogenous data and knowledge (i.e. data models such as ontologies, schemas) coming from different sources and with different formats (ie. structured, unstructured). Building and using Knowledge Graphs from such data and knowledge require many components ranging from data transformations (e.g ETL), data models (such as ontologies, schemas) defining the targeted domain, scalable stores for managing the resulting graph, as well as data and factual knowledge search and access. While many systems and tools that implement these components exist separately, they often come with a high level of complexity when dealing with data, ontologies and schemas.
Blue Brain Nexus Forge enables data scientists, data and knowledge engineers to uniquely combine these components under a consistent, single and generic Python Framework. With Blue Brain Nexus Forge, non-expert users can build and use Knowledge Graphs to:
Discover and reuse data models such as ontologies and schemas to shape, constrain, link and add semantics to datasets.
Build Knowledge Graphs from heterogeneous sources and formats: defining, executing and sharing data mappers to transform data from a source format to a target one conformant to schemas and ontologies
Integrate with various stores for Knowledge Graph storage, management, scaling capabilities, data search and access
Validate and register data and metadata
Search and download data and metadata from a Knowledge Graph without the complexity of underlying technologies such as JSON-LD/RDF (Resource Description Framework).
Find below the different Nexus Forge interfaces exposing its main features.
Forge¶
This is the main interface for configuring and using Nexus Forge main features.
Forge initialisation signature is:
KnowledgeGraphForge(configuration: Union[str, Dict], **kwargs)
where the configuration accepts a YAML file or a JSON dictionary, and **kwargs can be used to override the configuration provided for the Store.
The YAML configuration has the following structure:
Model:
name: <a class name of a Model>
origin: <'directory', 'url', or 'store'>
source: <a directory path, an URL, or the class name of a Store>
bucket: <when 'origin' is 'store', a Store bucket, a section or segment in the Store>
endpoint: <when 'origin' is 'store', a Store endpoint, default to Store:endpoint>
token: <when 'origin' is 'store', a Store token, default to Store:token>
context:
iri: <an IRI>
bucket: <when 'origin' is 'store', a Store bucket, default to Model:bucket>
endpoint: <when 'origin' is 'store', a Store endpoint, default to Model:endpoint>
token: <when 'origin' is 'store', a Store token, default to Model:token>
Store:
name: <a class name of a Store>
endpoint: <an URL>
bucket: <a bucket as a string>
token: <a token as a string>
versioned_id_template: <a string template using 'x' to access resource fields>
file_resource_mapping: <an Hjson string, a file path, or an URL>
Resolvers:
<scope>:
- resolver: <a class name of a Resolver>
origin: <'directory', 'web_service', or 'store'>
source: <a directory path, a web service endpoint, or the class name of a Store>
targets:
- identifier: <a name, or an IRI>
bucket: <a file name, an URL path, or a Store bucket>
result_resource_mapping: <an Hjson string, a file path, or an URL>
endpoint: <when 'origin' is 'store', a Store endpoint, default to Store:endpoint>
token: <when 'origin' is 'store', a Store token, default to Store:token>
Formatters:
<identifier>: <a string template with replacement fields delimited by braces, i.e. '{}'>
and the python configuration would be like:
{
"Model": {
"name": <str>,
"origin": <str>,
"source": <str>,
"bucket": <str>,
"endpoint": <str>,
"token": <str>,
"context": {
"iri": <str>,
"bucket": <str>,
"endpoint": <str>,
"token": <str>,
}
},
"Store": {
"name": <str>,
"endpoint": <str>,
"bucket": <str>,
"token": <str>,
"versioned_id_template": <str>,
"file_resource_mapping": <str>,
},
"Resolvers": {
"<scope>": [
{
"resolver": <str>,
"origin": <str>,
"source": <str>,
"targets": [
{
"identifier": <str>,
"bucket": <str>,
},
...,
],
"result_resource_mapping": <str>,
"endpoint": <str>,
"token": <str>,
},
...,
],
},
"Formatters": {
"<name>": <str>,
...,
},
}
The required minimal configuration is:
name for Model and Store
origin and source for Model
See nexus-forge/examples/configurations/ for YAML examples.
Create a forge instance:
forge = KnowledgeGraphForge("../path/to/configuration.yml")
Resource¶
A Resource is an identifiable data object with a set of properties. It is mainly identified by its Type, which value is a concept, such as, Person, Contributor, Organisation, Experiment, etc. Reserved properties of a Resource are: id, type and context.
Create a resource using keyword arguments, a JSON dictionary, or a dataframe:
resource = Resource(name="Jane Doe", type="Person", email="jane.doe@examole.org")
data = {
"name": "Jane Dow",
"type" : "Person",
"email" : "jane.doe@examole.org"
}
resource = Resource(data)
import pandas as pd
dataframe = pd.read_csv("data.csv")
resources = forge.from_dataframe(dataframe)
A resource can have files attached by assigning the output of forge.attach method to a property in the resource:
resource.picture = forge.attach("path/to/file.jpg")
Dataset¶
A Dataset is a specialization of a Resource that provides users with operations to handle files and describe them with metadata. The metadata of Datasets refers specifically to, but not limited to:
provenance: contribution (people or organizations that contributed to the creation of the Dataset),
generation (links to resources used to generate this Dataset),
derivation (links another resource from which the Dataset is generated),
invalidation (data became invalid)
access: distribution (a downloadable form of this Dataset, at a specific location, in a specific format)
The Dataset class provides methods for adding files to a Dataset. The added files will only be uploaded in the Store when the forge.register function is called on the Dataset so that the user flow is not slowed down and for efficiency purpose. We implemented this using the concept of LazyAction, which is a class that will hold an action that will be executed when required.
After the registration of the Dataset, a DataDownload resource will be created with some other automatically extracted properties, such as content type, size, file name, etc.
Dataset(forge: KnowledgeGraphForge, type: str = "Dataset", **properties)
add_parts(resources: List[Resource], versioned: bool = True) -> None
add_distribution(path: str, content_type: str = None) -> None
add_contribution(agent: str, **kwargs) -> None
add_generation(**kwargs) -> None
add_derivation(resource: Resource, versioned: bool = True, **kwargs) -> None
add_invalidation(**kwargs) -> None
add_files(path: str, content_type: str = None) -> None
download(source: str, path: str) -> None
Storing¶
Storing allows users to persist and manage Resources in the configured Store. The current supported stores are:
DemoStore: an in-memory Store (do not use it in production)
The Store interface can be extended to support other stores.
Resources contain additional information in hidden properties to allow users to recover from errors:
c_synchronized`, indicates that the last action succeeded
_last_action, contains information about the last action that took place in the resource (e.g. register, update, etc.)
_store_metadata, keeps additional resource metadata provided by the store such as version, creation date, etc.
register(data: Union[Resource, List[Resource]]) -> None
update(data: Union[Resource, List[Resource]]) -> None
deprecate(data: Union[Resource, List[Resource]]) -> None
Querying¶
It is possible to search for resources from the store by (1) retrieving them by id, (2) specifying filters with the properties and specific values and (3) using SPARQL query.
retrieve(id: str, version: Optional[Union[int, str]] = None) -> Resource
paths(type: str) -> PathsWrapper
search(*filters, **params) -> List[Resource]
sparql(query: str) -> List[Resource]
download(data: Union[Resource, List[Resource]], follow: str, path: str) -> None
Versioning¶
The user can create versions of Resources, if the Store supports this feature.
tag(data: Union[Resource, List[Resource]], value: str) -> None
freeze(data: Union[Resource, List[Resource]]) -> None
Converting¶
To use Resources with other libraries such as pandas, different data conversion functions are available.
as_json(data: Union[Resource, List[Resource]], expanded: bool = False, store_metadata: bool = False) -> Union[Dict, List[Dict]]
as_jsonld(data: Union[Resource, List[Resource]], compacted: bool = True, store_metadata: bool = False) -> Union[Dict, List[Dict]]
as_triples(data: Union[Resource, List[Resource]], store_metadata: bool = False) -> List[Tuple[str, str, str]]
as_dataframe(data: List[Resource], na: Union[Any, List[Any]] = [None], nesting: str = ".", expanded: bool = False, store_metadata: bool = False) -> DataFrame
from_json(data: Union[Dict, List[Dict]], na: Union[Any, List[Any]] = None) -> Union[Resource, List[Resource]]
from_jsonld(data: Union[Dict, List[Dict]]) -> Union[Resource, List[Resource]]
from_triples(data: List[Tuple[str, str, str]]) -> Union[Resource, List[Resource]]
from_dataframe(data: DataFrame, na: Union[Any, List[Any]] = np.nan, nesting: str = ".") -> Union[Resource, List[Resource]]
Formatting¶
A preconfigured set of string formats can be provided to ensure the consistency of data.
format(what: str, *args) -> str
Resolving¶
Resolvers are helpers to find commonly used resources that one may want to link to. For instance, one could have a set of pre-defined identifiers of Authors, and to make several references to the same Authors, a resolver can be used.
resolve(text: str, scope: Optional[str] = None, resolver: Optional[str] = None, target: Optional[str] = None, type: Optional[str] = None, strategy: ResolvingStrategy = ResolvingStrategy.BEST_MATCH) -> Optional[Union[Resource, List[Resource]]]
Reshaping¶
Reshaping allows trimming Resources by a specific set of properties.
reshape(data: Union[Resource, List[Resource]], keep: List[str], versioned: bool = False) -> Union[Resource, List[Resource]]
Modeling¶
Schemas describing the shapes and constrains applying to Resources can be loaded and accessed from the Modeling functions. Users can explore predefined domain Types and the properties that describe them using Templates. Templates can be used to create resources with the specified properties. Those resources can then be validated.
context() -> Optional[Dict]
prefixes(pretty: bool = True) -> Optional[Dict[str, str]]
types(pretty: bool = True) -> Optional[List[str]]
template(type: str, only_required: bool = False) -> None
validate(data: Union[Resource, List[Resource]]) -> None
At the time of this release, the supported schemas language is W3C SHACL. Nexus Forge can be extended to support other schema language (e.g JSON Schema).
Mapping¶
Mappings are pre-defined configuration files that encode the logic or rules on how to transform data from a specific source into Resources of a targeted Type and conformant to targeted schema. Mapping rules are to be executed by a mapper.
sources(pretty: bool = True) -> Optional[List[str]]
mappings(source: str, pretty: bool = True) -> Optional[Dict[str, List[str]]]
mapping(entity: str, source: str, type: Callable = DictionaryMapping) -> Mapping
map(data: Any, mapping: Union[Mapping, List[Mapping]], mapper: Callable = DictionaryMapper, na: Union[Any, List[Any]] = None) -> Union[Resource, List[Resource]]
Nexus Forge comes with a DictionaryMapper taking JSON structured data as a source and transform it into another JSON structured data.