Skip to content

Commit 4a93aad

Browse files
documentation:Move explanation of the make method to the make.md file.
1 parent 66a3f64 commit 4a93aad

File tree

2 files changed

+191
-189
lines changed

2 files changed

+191
-189
lines changed

docs/src/compute/make.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,193 @@ The `make` call of a master table first inserts the master entity and then inser
2323
the matching part entities in the part tables.
2424
None of the entities become visible to other processes until the entire `make` call
2525
completes, at which point they all become visible.
26+
27+
### Three-Part Make Pattern for Long Computations
28+
29+
For long-running computations, DataJoint provides an advanced pattern called the
30+
**three-part make** that separates the `make` method into three distinct phases.
31+
This pattern is essential for maintaining database performance and data integrity
32+
during expensive computations.
33+
34+
#### The Problem: Long Transactions
35+
36+
Traditional `make` methods perform all operations within a single database transaction:
37+
38+
```python
39+
def make(self, key):
40+
# All within one transaction
41+
data = (ParentTable & key).fetch1() # Fetch
42+
result = expensive_computation(data) # Compute (could take hours)
43+
self.insert1(dict(key, result=result)) # Insert
44+
```
45+
46+
This approach has significant limitations:
47+
- **Database locks**: Long transactions hold locks on tables, blocking other operations
48+
- **Connection timeouts**: Database connections may timeout during long computations
49+
- **Memory pressure**: All fetched data must remain in memory throughout the computation
50+
- **Failure recovery**: If computation fails, the entire transaction is rolled back
51+
52+
#### The Solution: Three-Part Make Pattern
53+
54+
The three-part make pattern splits the `make` method into three distinct phases,
55+
allowing the expensive computation to occur outside of database transactions:
56+
57+
```python
58+
def make_fetch(self, key):
59+
"""Phase 1: Fetch all required data from parent tables"""
60+
fetched_data = ((ParentTable1 & key).fetch1(), (ParentTable2 & key).fetch1())
61+
return fetched_data # must be a sequence, eg tuple or list
62+
63+
def make_compute(self, key, *fetched_data):
64+
"""Phase 2: Perform expensive computation (outside transaction)"""
65+
computed_result = expensive_computation(*fetched_data)
66+
return computed_result # must be a sequence, eg tuple or list
67+
68+
def make_insert(self, key, *computed_result):
69+
"""Phase 3: Insert results into the current table"""
70+
self.insert1(dict(key, result=computed_result))
71+
```
72+
73+
#### Execution Flow
74+
75+
To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence:
76+
77+
```python
78+
# Step 1: Fetch data outside transaction
79+
fetched_data1 = self.make_fetch(key)
80+
computed_result = self.make_compute(key, *fetched_data1)
81+
82+
# Step 2: Begin transaction and verify data consistency
83+
begin transaction:
84+
fetched_data2 = self.make_fetch(key)
85+
if fetched_data1 != fetched_data2: # deep comparison
86+
cancel transaction # Data changed during computation
87+
else:
88+
self.make_insert(key, *computed_result)
89+
commit_transaction
90+
```
91+
92+
#### Key Benefits
93+
94+
1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration
95+
2. **Connection Efficiency**: Database connections are only used briefly for data transfer
96+
3. **Memory Management**: Fetched data can be processed and released during computation
97+
4. **Fault Tolerance**: Computation failures don't affect database state
98+
5. **Scalability**: Multiple computations can run concurrently without database contention
99+
100+
#### Referential Integrity Protection
101+
102+
The pattern includes a critical safety mechanism: **referential integrity verification**.
103+
Before inserting results, the system:
104+
105+
1. Re-fetches the source data within the transaction
106+
2. Compares it with the originally fetched data using deep hashing
107+
3. Only proceeds with insertion if the data hasn't changed
108+
109+
This prevents the "phantom read" problem where source data changes during long computations,
110+
ensuring that results remain consistent with their inputs.
111+
112+
#### Implementation Details
113+
114+
The pattern is implemented using Python generators in the `AutoPopulate` class:
115+
116+
```python
117+
def make(self, key):
118+
# Step 1: Fetch data from parent tables
119+
fetched_data = self.make_fetch(key)
120+
computed_result = yield fetched_data
121+
122+
# Step 2: Compute if not provided
123+
if computed_result is None:
124+
computed_result = self.make_compute(key, *fetched_data)
125+
yield computed_result
126+
127+
# Step 3: Insert the computed result
128+
self.make_insert(key, *computed_result)
129+
yield
130+
```
131+
Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above.
132+
133+
#### Use Cases
134+
135+
This pattern is particularly valuable for:
136+
137+
- **Machine learning model training**: Hours-long training sessions
138+
- **Image processing pipelines**: Large-scale image analysis
139+
- **Statistical computations**: Complex statistical analyses
140+
- **Data transformations**: ETL processes with heavy computation
141+
- **Simulation runs**: Time-consuming simulations
142+
143+
#### Example: Long-Running Image Analysis
144+
145+
Here's an example of how to implement the three-part make pattern for a
146+
long-running image analysis task:
147+
148+
```python
149+
@schema
150+
class ImageAnalysis(dj.Computed):
151+
definition = """
152+
# Complex image analysis results
153+
-> Image
154+
---
155+
analysis_result : longblob
156+
processing_time : float
157+
"""
158+
159+
def make_fetch(self, key):
160+
"""Fetch the image data needed for analysis"""
161+
image_data = (Image & key).fetch1('image')
162+
params = (Params & key).fetch1('params')
163+
return (image_data, params) # pack fetched_data
164+
165+
def make_compute(self, key, image_data, params):
166+
"""Perform expensive image analysis outside transaction"""
167+
import time
168+
start_time = time.time()
169+
170+
# Expensive computation that could take hours
171+
result = complex_image_analysis(image_data, params)
172+
processing_time = time.time() - start_time
173+
return result, processing_time
174+
175+
def make_insert(self, key, analysis_result, processing_time):
176+
"""Insert the analysis results"""
177+
self.insert1(dict(key,
178+
analysis_result=analysis_result,
179+
processing_time=processing_time))
180+
```
181+
182+
The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above:
183+
184+
```python
185+
@schema
186+
class ImageAnalysis(dj.Computed):
187+
definition = """
188+
# Complex image analysis results
189+
-> Image
190+
---
191+
analysis_result : longblob
192+
processing_time : float
193+
"""
194+
195+
def make(self, key):
196+
image_data = (Image & key).fetch1('image')
197+
params = (Params & key).fetch1('params')
198+
computed_result = yield (image, params) # pack fetched_data
199+
200+
if computed_result is None:
201+
# Expensive computation that could take hours
202+
import time
203+
start_time = time.time()
204+
result = complex_image_analysis(image_data, params)
205+
processing_time = time.time() - start_time
206+
computed_result = result, processing_time #pack
207+
yield computed_result
208+
209+
result, processing_time = computed_result # unpack
210+
self.insert1(dict(key,
211+
analysis_result=result,
212+
processing_time=processing_time))
213+
yield # yield control back to the caller
214+
```
215+
We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity.

docs/src/compute/populate.md

Lines changed: 1 addition & 189 deletions
Original file line numberDiff line numberDiff line change
@@ -62,195 +62,7 @@ The `make` callback does three things:
6262
2. Computes and adds any missing attributes to the fields already in `key`.
6363
3. Inserts the entire entity into `self`.
6464

65-
`make` may populate multiple entities in one call when `key` does not specify the
66-
entire primary key of the populated table.
67-
68-
### Three-Part Make Pattern for Long Computations
69-
70-
For long-running computations, DataJoint provides an advanced pattern called the
71-
**three-part make** that separates the `make` method into three distinct phases.
72-
This pattern is essential for maintaining database performance and data integrity
73-
during expensive computations.
74-
75-
#### The Problem: Long Transactions
76-
77-
Traditional `make` methods perform all operations within a single database transaction:
78-
79-
```python
80-
def make(self, key):
81-
# All within one transaction
82-
data = (ParentTable & key).fetch1() # Fetch
83-
result = expensive_computation(data) # Compute (could take hours)
84-
self.insert1(dict(key, result=result)) # Insert
85-
```
86-
87-
This approach has significant limitations:
88-
- **Database locks**: Long transactions hold locks on tables, blocking other operations
89-
- **Connection timeouts**: Database connections may timeout during long computations
90-
- **Memory pressure**: All fetched data must remain in memory throughout the computation
91-
- **Failure recovery**: If computation fails, the entire transaction is rolled back
92-
93-
#### The Solution: Three-Part Make Pattern
94-
95-
The three-part make pattern splits the `make` method into three distinct phases,
96-
allowing the expensive computation to occur outside of database transactions:
97-
98-
```python
99-
def make_fetch(self, key):
100-
"""Phase 1: Fetch all required data from parent tables"""
101-
fetched_data = ((ParentTable & key).fetch1(),)
102-
return fetched_data # must be a sequence, eg tuple or list
103-
104-
def make_compute(self, key, *fetched_data):
105-
"""Phase 2: Perform expensive computation (outside transaction)"""
106-
computed_result = expensive_computation(*fetched_data)
107-
return computed_result # must be a sequence, eg tuple or list
108-
109-
def make_insert(self, key, *computed_result):
110-
"""Phase 3: Insert results into the current table"""
111-
self.insert1(dict(key, result=computed_result))
112-
```
113-
114-
#### Execution Flow
115-
116-
To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence:
117-
118-
```python
119-
# Step 1: Fetch data outside transaction
120-
fetched_data1 = self.make_fetch(key)
121-
computed_result = self.make_compute(key, *fetched_data1)
122-
123-
# Step 2: Begin transaction and verify data consistency
124-
begin transaction:
125-
fetched_data2 = self.make_fetch(key)
126-
if fetched_data1 != fetched_data2: # deep comparison
127-
cancel transaction # Data changed during computation
128-
else:
129-
self.make_insert(key, *computed_result)
130-
commit_transaction
131-
```
132-
133-
#### Key Benefits
134-
135-
1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration
136-
2. **Connection Efficiency**: Database connections are only used briefly for data transfer
137-
3. **Memory Management**: Fetched data can be processed and released during computation
138-
4. **Fault Tolerance**: Computation failures don't affect database state
139-
5. **Scalability**: Multiple computations can run concurrently without database contention
140-
141-
#### Referential Integrity Protection
142-
143-
The pattern includes a critical safety mechanism: **referential integrity verification**.
144-
Before inserting results, the system:
145-
146-
1. Re-fetches the source data within the transaction
147-
2. Compares it with the originally fetched data using deep hashing
148-
3. Only proceeds with insertion if the data hasn't changed
149-
150-
This prevents the "phantom read" problem where source data changes during long computations,
151-
ensuring that results remain consistent with their inputs.
152-
153-
#### Implementation Details
154-
155-
The pattern is implemented using Python generators in the `AutoPopulate` class:
156-
157-
```python
158-
def make(self, key):
159-
# Step 1: Fetch data from parent tables
160-
fetched_data = self.make_fetch(key)
161-
computed_result = yield fetched_data
162-
163-
# Step 2: Compute if not provided
164-
if computed_result is None:
165-
computed_result = self.make_compute(key, *fetched_data)
166-
yield computed_result
167-
168-
# Step 3: Insert the computed result
169-
self.make_insert(key, *computed_result)
170-
yield
171-
```
172-
Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above.
173-
174-
#### Use Cases
175-
176-
This pattern is particularly valuable for:
177-
178-
- **Machine learning model training**: Hours-long training sessions
179-
- **Image processing pipelines**: Large-scale image analysis
180-
- **Statistical computations**: Complex statistical analyses
181-
- **Data transformations**: ETL processes with heavy computation
182-
- **Simulation runs**: Time-consuming simulations
183-
184-
#### Example: Long-Running Image Analysis
185-
186-
Here's an example of how to implement the three-part make pattern for a
187-
long-running image analysis task:
188-
189-
```python
190-
@schema
191-
class ImageAnalysis(dj.Computed):
192-
definition = """
193-
# Complex image analysis results
194-
-> Image
195-
---
196-
analysis_result : longblob
197-
processing_time : float
198-
"""
199-
200-
def make_fetch(self, key):
201-
"""Fetch the image data needed for analysis"""
202-
return (Image & key).fetch1('image'),
203-
204-
def make_compute(self, key, image_data):
205-
"""Perform expensive image analysis outside transaction"""
206-
import time
207-
start_time = time.time()
208-
209-
# Expensive computation that could take hours
210-
result = complex_image_analysis(image_data)
211-
processing_time = time.time() - start_time
212-
return result, processing_time
213-
214-
def make_insert(self, key, analysis_result, processing_time):
215-
"""Insert the analysis results"""
216-
self.insert1(dict(key,
217-
analysis_result=analysis_result,
218-
processing_time=processing_time))
219-
```
220-
221-
The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above:
222-
223-
```python
224-
@schema
225-
class ImageAnalysis(dj.Computed):
226-
definition = """
227-
# Complex image analysis results
228-
-> Image
229-
---
230-
analysis_result : longblob
231-
processing_time : float
232-
"""
233-
234-
def make(self, key):
235-
image_data = (Image & key).fetch1('image')
236-
computed_result = yield (image, ) # pack fetched_data
237-
238-
if computed_result is None:
239-
# Expensive computation that could take hours
240-
import time
241-
start_time = time.time()
242-
result = complex_image_analysis(image_data)
243-
processing_time = time.time() - start_time
244-
computed_result = result, processing_time #pack
245-
yield computed_result
246-
247-
result, processing_time = computed_result # unpack
248-
self.insert1(dict(key,
249-
analysis_result=result,
250-
processing_time=processing_time))
251-
yield # yield control back to the caller
252-
```
253-
We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity.
65+
`make` may populate multiple entities in one call when `key` does not specify the entire primary key of the populated table.
25466

25567
## Populate
25668

0 commit comments

Comments
 (0)