Skip to content

Commit 8996d61

Browse files
authored
docs: add Getting Started page (#2113)
* docs: add Getting Started page Signed-off-by: Panos Vagenas <[email protected]> * refactor usage Signed-off-by: Panos Vagenas <[email protected]> * minor renaming Signed-off-by: Panos Vagenas <[email protected]> --------- Signed-off-by: Panos Vagenas <[email protected]>
1 parent 555506d commit 8996d61

File tree

5 files changed

+248
-248
lines changed

5 files changed

+248
-248
lines changed

docs/getting_started/index.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
🐣 Ready to kick off your Docling journey? Let's dive right into it!
2+
3+
<div class="grid">
4+
<a href="../installation/" class="card"><b>⬇️ Installation</b><br />Quickly install Docling in your environment</a>
5+
<a href="../usage/" class="card"><b>▶️ Usage</b><br />Get a jumpstart on basic Docling usage</a>
6+
<a href="../concepts/" class="card"><b>🧩 Concepts</b><br />Learn Docling fundamentals and get a glimpse under the hood</a>
7+
<a href="../examples/" class="card"><b>🧑🏽‍🍳 Examples</b><br />Try out recipes for various use cases, including conversion, RAG, and more</a>
8+
<a href="../integrations/" class="card"><b>🤖 Integrations</b><br />Check out integrations with popular AI tools and frameworks</a>
9+
<a href="../reference/document_converter/" class="card"><b>📖 Reference</b><br />See more API details</a>
10+
</div>
11+
12+
## What's next
13+
14+
🚀 The journey has just begun! Join us and become a part of the growing Docling community!
15+
16+
- <a href="https://github.com/docling-project/docling">:fontawesome-brands-github: GitHub</a>
17+
- <a href="https://linkedin.com/company/docling/">:fontawesome-brands-linkedin: LinkedIn</a>

docs/index.md

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -40,16 +40,11 @@ Docling simplifies document processing, parsing diverse formats — including ad
4040

4141
## Get started
4242

43-
<div class="grid">
44-
<a href="concepts/" class="card"><b>Concepts</b><br />Learn Docling fundamentals</a>
45-
<a href="examples/" class="card"><b>Examples</b><br />Try out recipes for various use cases, including conversion, RAG, and more</a>
46-
<a href="integrations/" class="card"><b>Integrations</b><br />Check out integrations with popular frameworks and tools</a>
47-
<a href="reference/document_converter/" class="card"><b>Reference</b><br />See more API details</a>
48-
</div>
43+
Check out our [getting started](./getting_started/index.md) page to get the ball rolling!
4944

5045
## Live assistant
5146

52-
Do you want to leverage the power of AI and get a live support on Docling?
47+
Do you want to leverage the power of AI and get live support on Docling?
5348
Try out the [Chat with Dosu](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github) functionalities provided by our friends at [Dosu](https://dosu.dev/).
5449

5550
[![Chat with Dosu](https://dosu.dev/dosu-chat-badge.svg)](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/ask?utm_source=github)

docs/usage/advanced_options.md

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
## Model prefetching and offline usage
2+
3+
By default, models are downloaded automatically upon first usage. If you would prefer
4+
to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do
5+
that as follows:
6+
7+
**Step 1: Prefetch the models**
8+
9+
Use the `docling-tools models download` utility:
10+
11+
```sh
12+
$ docling-tools models download
13+
Downloading layout model...
14+
Downloading tableformer model...
15+
Downloading picture classifier model...
16+
Downloading code formula model...
17+
Downloading easyocr models...
18+
Models downloaded into $HOME/.cache/docling/models.
19+
```
20+
21+
Alternatively, models can be programmatically downloaded using `docling.utils.model_downloader.download_models()`.
22+
23+
**Step 2: Use the prefetched models**
24+
25+
```python
26+
from docling.datamodel.base_models import InputFormat
27+
from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions
28+
from docling.document_converter import DocumentConverter, PdfFormatOption
29+
30+
artifacts_path = "/local/path/to/models"
31+
32+
pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)
33+
doc_converter = DocumentConverter(
34+
format_options={
35+
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
36+
}
37+
)
38+
```
39+
40+
Or using the CLI:
41+
42+
```sh
43+
docling --artifacts-path="/local/path/to/models" FILE
44+
```
45+
46+
Or using the `DOCLING_ARTIFACTS_PATH` environment variable:
47+
48+
```sh
49+
export DOCLING_ARTIFACTS_PATH="/local/path/to/models"
50+
python my_docling_script.py
51+
```
52+
53+
## Using remote services
54+
55+
The main purpose of Docling is to run local models which are not sharing any user data with remote services.
56+
Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.
57+
58+
In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.
59+
60+
```py
61+
from docling.datamodel.base_models import InputFormat
62+
from docling.datamodel.pipeline_options import PdfPipelineOptions
63+
from docling.document_converter import DocumentConverter, PdfFormatOption
64+
65+
pipeline_options = PdfPipelineOptions(enable_remote_services=True)
66+
doc_converter = DocumentConverter(
67+
format_options={
68+
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
69+
}
70+
)
71+
```
72+
73+
When the value `enable_remote_services=True` is not set, the system will raise an exception `OperationNotAllowed()`.
74+
75+
_Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in [Model prefetching and offline usage](#model-prefetching-and-offline-usage)._
76+
77+
### List of remote model services
78+
79+
The options in this list require the explicit `enable_remote_services=True` when processing the documents.
80+
81+
- `PictureDescriptionApiOptions`: Using vision models via API calls.
82+
83+
84+
## Adjust pipeline features
85+
86+
The example file [custom_convert.py](../examples/custom_convert.py) contains multiple ways
87+
one can adjust the conversion pipeline and features.
88+
89+
### Control PDF table extraction options
90+
91+
You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself.
92+
This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
93+
94+
95+
```python
96+
from docling.datamodel.base_models import InputFormat
97+
from docling.document_converter import DocumentConverter, PdfFormatOption
98+
from docling.datamodel.pipeline_options import PdfPipelineOptions
99+
100+
pipeline_options = PdfPipelineOptions(do_table_structure=True)
101+
pipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model
102+
103+
doc_converter = DocumentConverter(
104+
format_options={
105+
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
106+
}
107+
)
108+
```
109+
110+
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between `TableFormerMode.FAST` (faster but less accurate) and `TableFormerMode.ACCURATE` (default) to receive better quality with difficult table structures.
111+
112+
```python
113+
from docling.datamodel.base_models import InputFormat
114+
from docling.document_converter import DocumentConverter, PdfFormatOption
115+
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
116+
117+
pipeline_options = PdfPipelineOptions(do_table_structure=True)
118+
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
119+
120+
doc_converter = DocumentConverter(
121+
format_options={
122+
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
123+
}
124+
)
125+
```
126+
127+
128+
## Impose limits on the document size
129+
130+
You can limit the file size and number of pages which should be allowed to process per document:
131+
132+
```python
133+
from pathlib import Path
134+
from docling.document_converter import DocumentConverter
135+
136+
source = "https://arxiv.org/pdf/2408.09869"
137+
converter = DocumentConverter()
138+
result = converter.convert(source, max_num_pages=100, max_file_size=20971520)
139+
```
140+
141+
## Convert from binary PDF streams
142+
143+
You can convert PDFs from a binary stream instead of from the filesystem as follows:
144+
145+
```python
146+
from io import BytesIO
147+
from docling.datamodel.base_models import DocumentStream
148+
from docling.document_converter import DocumentConverter
149+
150+
buf = BytesIO(your_binary_stream)
151+
source = DocumentStream(name="my_doc.pdf", stream=buf)
152+
converter = DocumentConverter()
153+
result = converter.convert(source)
154+
```
155+
156+
## Limit resource usage
157+
158+
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
159+
160+
161+
## Use specific backend converters
162+
163+
!!! note
164+
165+
This section discusses directly invoking a [backend](../concepts/architecture.md),
166+
i.e. using a low-level API. This should only be done when necessary. For most cases,
167+
using a `DocumentConverter` (high-level API) as discussed in the sections above
168+
should suffice — and is the recommended way.
169+
170+
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)).
171+
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
172+
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
173+
174+
```python
175+
import urllib.request
176+
from io import BytesIO
177+
from docling.backend.html_backend import HTMLDocumentBackend
178+
from docling.datamodel.base_models import InputFormat
179+
from docling.datamodel.document import InputDocument
180+
181+
url = "https://en.wikipedia.org/wiki/Duck"
182+
text = urllib.request.urlopen(url).read()
183+
in_doc = InputDocument(
184+
path_or_stream=BytesIO(text),
185+
format=InputFormat.HTML,
186+
backend=HTMLDocumentBackend,
187+
filename="duck.html",
188+
)
189+
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
190+
dl_doc = backend.convert()
191+
print(dl_doc.export_to_markdown())
192+
```

0 commit comments

Comments
 (0)