Quickstart Guide
To get started using NVIDIA-Ingest, you need to do a few things:
- Start supporting NIM microservices ποΈ
- Install the NVIDIA Ingest client dependencies in a Python environment π
- Submit ingestion job(s) π
- Inspect and consume results π
Step 1: Starting Containers
This example demonstrates how to use the provided docker-compose.yaml to start all needed services with a few commands.
IMPORTANT: NIM containers on their first startup can take 10-15 minutes to pull and fully load models.
If preferred, you can also start services one by one or run on Kubernetes via our Helm chart. Also, there are additional environment variables you want to configure.
- Git clone the repo:
git clone https://github.com/nvidia/nv-ingest -
Change the directory to the cloned repo
cd nv-ingest. -
Generate API keys and authenticate with NGC with the
docker logincommand:# This is required to access pre-built containers and NIM microservices $ docker login nvcr.io Username: $oauthtoken Password: <Your Key>
[!NOTE] During the early access (EA) phase, you must apply for early access here: https://developer.nvidia.com/nemo-microservices-early-access/join. When your early access is approved, follow the instructions in the email to create an organization and team, link your profile, and generate your NGC API key.
- Create a .env file containing your NGC API key and the following paths. For more information, refer to Environment Configuration Variables.
# Container images must access resources from NGC.
NGC_API_KEY=<key to download containers from NGC>
NIM_NGC_API_KEY=<key to download model files after containers start>
NVIDIA_BUILD_API_KEY=<key to use NIMs that are hosted on build.nvidia.com>
DATASET_ROOT=<PATH_TO_THIS_REPO>/data
NV_INGEST_ROOT=<PATH_TO_THIS_REPO>
NOTE: As configured by default in docker-compose.yaml, the DePlot NIM is on a dedicated GPU. All other NIMs and the NV-Ingest container itself share a second. This avoids DePlot and other NIMs competing for VRAM on the same device.
Change the
CUDA_VISIBLE_DEVICESpinnings as desired for your system within docker-compose.yaml.IMPORTANT: Make sure NVIDIA is set as your default container runtime before running the docker compose command with the command:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
- Start all services:
docker compose up
TIP: By default, we have [configured log levels to be verbose](docker-compose.yaml.
It's possible to observe service startup proceeding. You will notice a lot of log messages. Disable verbose logging by configuring
NIM_TRITON_LOG_VERBOSE=0for each NIM in docker-compose.yaml.
- When all services have fully started,
nvidia-smishould show processes like the following:# If it's taking > 1m for `nvidia-smi` to return, the bus will likely be busy setting up the models. +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1352957 C tritonserver 762MiB | | 1 N/A N/A 1322081 C /opt/nim/llm/.venv/bin/python3 63916MiB | | 2 N/A N/A 1355175 C tritonserver 478MiB | | 2 N/A N/A 1367569 C ...s/python/triton_python_backend_stub 12MiB | | 3 N/A N/A 1321841 C python 414MiB | | 3 N/A N/A 1352331 C tritonserver 478MiB | | 3 N/A N/A 1355929 C ...s/python/triton_python_backend_stub 424MiB | | 3 N/A N/A 1373202 C tritonserver 414MiB | +---------------------------------------------------------------------------------------+
Observe the started containers with docker ps:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f2f86615ea5 nvcr.io/nvidia/nemo-microservices/nv-ingest:24.12 "/opt/conda/bin/tiniβ¦" 35 seconds ago Up 33 seconds 0.0.0.0:7670->7670/tcp, :::7670->7670/tcp nv-ingest-nv-ingest-ms-runtime-1
de44122c6ddc otel/opentelemetry-collector-contrib:0.91.0 "/otelcol-contrib --β¦" 14 hours ago Up 24 seconds 0.0.0.0:4317-4318->4317-4318/tcp, :::4317-4318->4317-4318/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, :::8888-8889->8888-8889/tcp, 0.0.0.0:13133->13133/tcp, :::13133->13133/tcp, 55678/tcp, 0.0.0.0:32849->9411/tcp, :::32848->9411/tcp, 0.0.0.0:55680->55679/tcp, :::55680->55679/tcp nv-ingest-otel-collector-1
02c9ab8c6901 nvcr.io/nvidia/nemo-microservices/cached:0.2.0 "/opt/nvidia/nvidia_β¦" 14 hours ago Up 24 seconds 0.0.0.0:8006->8000/tcp, :::8006->8000/tcp, 0.0.0.0:8007->8001/tcp, :::8007->8001/tcp, 0.0.0.0:8008->8002/tcp, :::8008->8002/tcp nv-ingest-cached-1
d49369334398 nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.1.0 "/opt/nvidia/nvidia_β¦" 14 hours ago Up 33 seconds 0.0.0.0:8012->8000/tcp, :::8012->8000/tcp, 0.0.0.0:8013->8001/tcp, :::8013->8001/tcp, 0.0.0.0:8014->8002/tcp, :::8014->8002/tcp nv-ingest-embedding-1
508715a24998 nvcr.io/nvidia/nemo-microservices/nv-yolox-structured-images-v1:0.2.0 "/opt/nvidia/nvidia_β¦" 14 hours ago Up 33 seconds 0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp nv-ingest-yolox-1
5b7a174a0a85 nvcr.io/nvidia/nemo-microservices/deplot:1.0.0 "/opt/nvidia/nvidia_β¦" 14 hours ago Up 33 seconds 0.0.0.0:8003->8000/tcp, :::8003->8000/tcp, 0.0.0.0:8004->8001/tcp, :::8004->8001/tcp, 0.0.0.0:8005->8002/tcp, :::8005->8002/tcp nv-ingest-deplot-1
430045f98c02 nvcr.io/nvidia/nemo-microservices/paddleocr:0.2.0 "/opt/nvidia/nvidia_β¦" 14 hours ago Up 24 seconds 0.0.0.0:8009->8000/tcp, :::8009->8000/tcp, 0.0.0.0:8010->8001/tcp, :::8010->8001/tcp, 0.0.0.0:8011->8002/tcp, :::8011->8002/tcp nv-ingest-paddle-1
8e587b45821b grafana/grafana "/run.sh" 14 hours ago Up 33 seconds 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp grafana-service
aa2c0ec387e2 redis/redis-stack "/entrypoint.sh" 14 hours ago Up 33 seconds 0.0.0.0:6379->6379/tcp, :::6379->6379/tcp, 8001/tcp nv-ingest-redis-1
bda9a2a9c8b5 openzipkin/zipkin "start-zipkin" 14 hours ago Up 33 seconds (healthy) 9410/tcp, 0.0.0.0:9411->9411/tcp, :::9411->9411/tcp nv-ingest-zipkin-1
ac27e5297d57 prom/prometheus:latest "/bin/prometheus --wβ¦" 14 hours ago Up 33 seconds 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp nv-ingest-prometheus-1
TIP: NV-Ingest is in early access (EA) mode, meaning the codebase gets frequent updates. To build an updated NV-Ingest service container with the latest changes, you can:
After the image is built, rundocker compose builddocker compose upper item 5 above.
Step 2: Installing Python Dependencies
You can interact with the NV-Ingest service from the host or by docker exec-ing into the NV-Ingest container.
To interact from the host, you'll need a Python environment and install the client dependencies:
# conda not required but makes it easy to create a fresh Python environment
conda create --name nv-ingest-dev python=3.10
conda activate nv-ingest-dev
cd client
pip install .
NOTE: Interacting from the host depends on the appropriate port being exposed from the nv-ingest container to the host as defined in docker-compose.yaml.
If you prefer, you can disable exposing that port and interact with the NV-Ingest service directly from within its container.
To interact within the container:
You'll be in thedocker exec -it nv-ingest-nv-ingest-ms-runtime-1 bash/workspacedirectory withDATASET_ROOTfrom the .env file mounted at./data. The pre-activatedmorpheusconda environment has all the Python client libraries pre-installed:(morpheus) root@aba77e2a4bde:/workspace#From the bash prompt above, you can run the nv-ingest-cli and Python examples described below.
Step 3: Ingesting Documents
You can submit jobs programmatically in Python or using the nv-ingest-cli tool.
In the below examples, we are doing text, chart, table, and image extraction:
- extract_text, - uses PDFium to find and extract text from pages
- extract_images - uses PDFium to extract images
- extract_tables - uses YOLOX to find tables and charts. Uses PaddleOCR for table extraction, and Deplot and CACHED for chart extraction
- extract_charts - (optional) enables or disables the use of Deplot and CACHED for chart extraction.
IMPORTANT:
extract_tablescontrols extraction for both tables and charts. You can optionally disable chart extraction by settingextract_chartsto false.
In Python
You can find more documentation and examples here:
import logging, time
from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import JobSpec
from nv_ingest_client.primitives.tasks import ExtractTask
from nv_ingest_client.util.file_processing.extract import extract_file_content
logger = logging.getLogger("nv_ingest_client")
file_name = "data/multimodal_test.pdf"
file_content, file_type = extract_file_content(file_name)
# A JobSpec is an object that defines a document and how it should
# be processed by the nv-ingest service.
job_spec = JobSpec(
document_type=file_type,
payload=file_content,
source_id=file_name,
source_name=file_name,
extended_options=
{
"tracing_options":
{
"trace": True,
"ts_send": time.time_ns()
}
}
)
# configure desired extraction modes here. Multiple extraction
# methods can be defined for a single JobSpec
extract_task = ExtractTask(
document_type=file_type,
extract_text=True,
extract_images=True,
extract_tables=True
)
job_spec.add_task(extract_task)
# Create the client and inform it about the JobSpec we want to process.
client = NvIngestClient(
message_client_hostname="localhost", # Host where nv-ingest-ms-runtime is running
message_client_port=7670 # REST port, defaults to 7670
)
job_id = client.add_job(job_spec)
client.submit_job(job_id, "morpheus_task_queue")
result = client.fetch_job_result(job_id, timeout=60)
print(f"Got {len(result)} results")
Using the nv-ingest-cli
You can find more nv-ingest-cli examples here:
nv-ingest-cli \
--doc ./data/multimodal_test.pdf \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_tables": "true", "extract_images": "true"}' \
--client_host=localhost \
--client_port=7670
You should notice output indicating document processing status followed by a breakdown of time spent during job execution:
INFO:nv_ingest_client.nv_ingest_cli:Processing 1 documents.
INFO:nv_ingest_client.nv_ingest_cli:Output will be written to: ./processed_docs
Processing files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:10<00:00, 10.47s/file, pages_per_sec=0.29]
INFO:nv_ingest_client.cli.util.processing:dedup_images: Avg: 1.02 ms, Median: 1.02 ms, Total Time: 1.02 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:dedup_images_channel_in: Avg: 1.44 ms, Median: 1.44 ms, Total Time: 1.44 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:docx_content_extractor: Avg: 0.66 ms, Median: 0.66 ms, Total Time: 0.66 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:docx_content_extractor_channel_in: Avg: 1.09 ms, Median: 1.09 ms, Total Time: 1.09 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:filter_images: Avg: 0.84 ms, Median: 0.84 ms, Total Time: 0.84 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:filter_images_channel_in: Avg: 7.75 ms, Median: 7.75 ms, Total Time: 7.75 ms, Total % of Trace Computation: 0.07%
INFO:nv_ingest_client.cli.util.processing:job_counter: Avg: 2.13 ms, Median: 2.13 ms, Total Time: 2.13 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:job_counter_channel_in: Avg: 2.05 ms, Median: 2.05 ms, Total Time: 2.05 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:metadata_injection: Avg: 14.48 ms, Median: 14.48 ms, Total Time: 14.48 ms, Total % of Trace Computation: 0.14%
INFO:nv_ingest_client.cli.util.processing:metadata_injection_channel_in: Avg: 0.22 ms, Median: 0.22 ms, Total Time: 0.22 ms, Total % of Trace Computation: 0.00%
INFO:nv_ingest_client.cli.util.processing:pdf_content_extractor: Avg: 10332.97 ms, Median: 10332.97 ms, Total Time: 10332.97 ms, Total % of Trace Computation: 99.45%
INFO:nv_ingest_client.cli.util.processing:pdf_content_extractor_channel_in: Avg: 0.44 ms, Median: 0.44 ms, Total Time: 0.44 ms, Total % of Trace Computation: 0.00%
INFO:nv_ingest_client.cli.util.processing:pptx_content_extractor: Avg: 1.19 ms, Median: 1.19 ms, Total Time: 1.19 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:pptx_content_extractor_channel_in: Avg: 0.98 ms, Median: 0.98 ms, Total Time: 0.98 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:redis_source_network_in: Avg: 12.27 ms, Median: 12.27 ms, Total Time: 12.27 ms, Total % of Trace Computation: 0.12%
INFO:nv_ingest_client.cli.util.processing:redis_task_sink_channel_in: Avg: 2.16 ms, Median: 2.16 ms, Total Time: 2.16 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:redis_task_source: Avg: 8.00 ms, Median: 8.00 ms, Total Time: 8.00 ms, Total % of Trace Computation: 0.08%
INFO:nv_ingest_client.cli.util.processing:Unresolved time: 82.82 ms, Percent of Total Elapsed: 0.79%
INFO:nv_ingest_client.cli.util.processing:Processed 1 files in 10.47 seconds.
INFO:nv_ingest_client.cli.util.processing:Total pages processed: 3
INFO:nv_ingest_client.cli.util.processing:Throughput (Pages/sec): 0.29
INFO:nv_ingest_client.cli.util.processing:Throughput (Files/sec): 0.10
Step 4: Inspecting and Consuming Results
After the ingestion steps above have been completed, you should be able to find the text and image subfolders inside your processed docs folder. Each will contain JSON-formatted extracted content and metadata.
When processing has completed, you'll have separate result files for text and image data:
ls -R processed_docs/
processed_docs/:
image structured text
processed_docs/image:
multimodal_test.pdf.metadata.json
processed_docs/structured:
multimodal_test.pdf.metadata.json
processed_docs/text:
multimodal_test.pdf.metadata.json
First, install tkinter by running the following commands depending on your OS.
- For Ubuntu/Debian Linux:
sudo apt-get update
sudo apt-get install python3-tk
sudo dnf install python3-tkinter
brew install python-tk
python src/util/image_viewer.py --file_path ./processed_docs/image/multimodal_test.pdf.metadata.json
TIP: Beyond inspecting the results, you can read them into things like llama-index or langchain retrieval pipelines.
Also, checkout our demo using a retrieval pipeline on build.nvidia.com to query over document content pre-extracted with NV-Ingest.
Repo Structure
Beyond the relevant documentation, examples, and other links above, below is a description of the contents in this repo's folders:
- .github: GitHub repo configuration files
- ci: Scripts used to build the NV-Ingest container and other packages
- client: Docs and source code for the nv-ingest-cli utility
- config: Various .yaml files defining configuration for OTEL, Prometheus
- data: Sample PDFs provided for testing convenience
- docker: Houses scripts used by the nv-ingest docker container
- docs: Various READMEs describing deployment, metadata schemas, auth and telemetry setup
- examples: Example notebooks, scripts, and longer-form tutorial content
- helm: Documentation for deploying NV-Ingest to a Kubernetes cluster via Helm chart
- skaffold: Skaffold configuration
- src: Source code for the NV-Ingest pipelines and service
- tests: Unit tests for NV-Ingest
Notices
Third-Party License Notice:
If configured to do so, this project will download and install additional third-party open-source software projects. Review the license terms of these open-source projects before use:
https://pypi.org/project/pdfservices-sdk/
INSTALL_ADOBE_SDK:- Description: If set to
true, the Adobe SDK will be installed in the container at launch time. This is required if you want to use the Adobe extraction service for PDF decomposition. Review the license agreement for the pdfservices-sdk before enabling this option.
Contributing
We require that all contributors "sign off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license or a compatible license.
Any contribution that contains commits that aren't signed off won't be accepted.
To sign off on a commit, use the --signoff (or -s) option when committing your changes:
$ git commit -s -m "Add cool feature."
This appends the following to your commit message:
Signed-off-by: Your Name <your@email.com>
Full text of the DCO:
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open-source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.