DataHub Developer's Guide
Requirements
- Java 17 JDK
- Python 3.10
- Docker
- Docker Compose >=2.20
- Docker engine with at least 8GB of memory to run tests.
On macOS, these can be installed using Homebrew.
# Install Java
brew install openjdk@17
# Install Python
brew install python@3.10 # you may need to add this to your PATH
# alternatively, you can use pyenv to manage your python versions
# Install docker and docker compose
brew install --cask docker
Building the Project
Fork and clone the repository if haven't done so already
git clone https://github.com/{username}/datahub.git
Change into the repository's root directory
cd datahub
Use gradle wrapper to build the project
./gradlew build
Note that the above will also run run tests and a number of validations which makes the process considerably slower.
We suggest partially compiling DataHub according to your needs:
Build Datahub's backend GMS (Generalized metadata service):
./gradlew :metadata-service:war:build
Build Datahub's frontend:
./gradlew :datahub-frontend:dist -x yarnTest -x yarnLint
Build DataHub's command line tool:
./gradlew :metadata-ingestion:installDev
Build DataHub's documentation:
./gradlew :docs-website:yarnLintFix :docs-website:build -x :metadata-ingestion:runPreFlightScript
# To preview the documentation
./gradlew :docs-website:serve
Deploying Local Versions
Run just once to have the local datahub
cli tool installed in your $PATH
cd smoke-test/
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
cd ../
Once you have compiled & packaged the project or appropriate module you can deploy the entire system via docker-compose by running:
./gradlew quickstart
Replace whatever container you want in the existing deployment. I.e, replacing datahub's backend (GMS):
(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate --build datahub-gms)
Running the local version of the frontend
(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate --build datahub-frontend-react)
IDE Support
The recommended IDE for DataHub development is IntelliJ IDEA. You can run the following command to generate or update the IntelliJ project file.
./gradlew idea
Open datahub.ipr
in IntelliJ to start developing!
For consistency please import and auto format the code using LinkedIn IntelliJ Java style.
Windows Compatibility
For optimal performance and compatibility, we strongly recommend building on a Mac or Linux system. Please note that we do not actively support Windows in a non-virtualized environment.
If you must use Windows, one workaround is to build within a virtualized environment, such as a VM(Virtual Machine) or WSL(Windows Subsystem for Linux). This approach can help ensure that your build environment remains isolated and stable, and that your code is compiled correctly.
Common Build Issues
Getting Unsupported class file major version 57
You're probably using a Java version that's too new for gradle. Run the following command to check your Java version
java --version
While it may be possible to build and run DataHub using newer versions of Java, we currently only support Java 17 (aka Java 17).
Getting cannot find symbol
error for javax.annotation.Generated
Similar to the previous issue, please use Java 17 to build the project.
You can install multiple version of Java on a single machine and switch between them using the JAVA_HOME
environment variable. See this document for more details.
:metadata-models:generateDataTemplate
task fails with java.nio.file.InvalidPathException: Illegal char <:> at index XX
or Caused by: java.lang.IllegalArgumentException: 'other' has different root
error
This is a known issue when building the project on Windows due a bug in the Pegasus plugin. Please refer to Windows Compatibility.
Various errors related to generateDataTemplate
or other generate
tasks
As we generate quite a few files from the models, it is possible that old generated files may conflict with new model changes. When this happens, a simple ./gradlew clean
should reosolve the issue.
Execution failed for task ':metadata-service:restli-servlet-impl:checkRestModel'
This generally means that an incompatible change was introduced to the rest.li API in GMS. You'll need to rebuild the snapshots/IDL by running the following command once
./gradlew :metadata-service:restli-servlet-impl:build -Prest.model.compatibility=ignore
java.io.IOException: No space left on device
This means you're running out of space on your disk to build. Please free up some space or try a different disk.
Build failed
for task ./gradlew :datahub-frontend:dist -x yarnTest -x yarnLint
This could mean that you need to update your Yarn version
:buildSrc:compileJava
task fails with package com.linkedin.metadata.models.registry.config does not exist
and cannot find symbol
error for Entity
There are currently two symbolic links within the buildSrc directory for the com.linkedin.metadata.aspect.plugins.config and com.linkedin.metadata.models.registry.config packages, which points to the corresponding packages in the entity-registry subproject.
When the repository is checked out using Windows 10/11 - even if WSL is later used for building using the mounted Windows filesystem in /mnt/
- the symbolic links might have not been created correctly, instead the symbolic links were checked out as plain files. Although it is technically possible to use the mounted Windows filesystem in /mnt/
for building in WSL, it is strongly recommended to checkout the repository within the Linux filesystem (e.g., in /home/
) and building it from there, because accessing the Windows filesystem from Linux is relatively slow compared to the Linux filesystem and slows down the whole building process.
To be able to create symbolic links in Windows 10/11 the Developer Mode has to be enabled first. Then the following commands can be used to enable symbolic links in Git and recreating the symbolic links:
# enable core.symlinks config
git config --global core.symlinks true
# check the current core.sysmlinks config and scope
git config --show-scope --show-origin core.symlinks
# in case the core.sysmlinks config is still set locally to false, remove the local config
git config --unset core.symlinks
# reset the current branch to recreate the missing symbolic links (alternatively it is also possibly to switch branches away and back)
git reset --hard
See also here for more information on how to enable symbolic links on Windows 10/11 and Git.