@@ -23,3 +23,193 @@ The `make` call of a master table first inserts the master entity and then inser
23
23
the matching part entities in the part tables.
24
24
None of the entities become visible to other processes until the entire ` make ` call
25
25
completes, at which point they all become visible.
26
+
27
+ ### Three-Part Make Pattern for Long Computations
28
+
29
+ For long-running computations, DataJoint provides an advanced pattern called the
30
+ ** three-part make** that separates the ` make ` method into three distinct phases.
31
+ This pattern is essential for maintaining database performance and data integrity
32
+ during expensive computations.
33
+
34
+ #### The Problem: Long Transactions
35
+
36
+ Traditional ` make ` methods perform all operations within a single database transaction:
37
+
38
+ ``` python
39
+ def make (self , key ):
40
+ # All within one transaction
41
+ data = (ParentTable & key).fetch1() # Fetch
42
+ result = expensive_computation(data) # Compute (could take hours)
43
+ self .insert1(dict (key, result = result)) # Insert
44
+ ```
45
+
46
+ This approach has significant limitations:
47
+ - ** Database locks** : Long transactions hold locks on tables, blocking other operations
48
+ - ** Connection timeouts** : Database connections may timeout during long computations
49
+ - ** Memory pressure** : All fetched data must remain in memory throughout the computation
50
+ - ** Failure recovery** : If computation fails, the entire transaction is rolled back
51
+
52
+ #### The Solution: Three-Part Make Pattern
53
+
54
+ The three-part make pattern splits the ` make ` method into three distinct phases,
55
+ allowing the expensive computation to occur outside of database transactions:
56
+
57
+ ``` python
58
+ def make_fetch (self , key ):
59
+ """ Phase 1: Fetch all required data from parent tables"""
60
+ fetched_data = ((ParentTable1 & key).fetch1(), (ParentTable2 & key).fetch1())
61
+ return fetched_data # must be a sequence, eg tuple or list
62
+
63
+ def make_compute (self , key , * fetched_data ):
64
+ """ Phase 2: Perform expensive computation (outside transaction)"""
65
+ computed_result = expensive_computation(* fetched_data)
66
+ return computed_result # must be a sequence, eg tuple or list
67
+
68
+ def make_insert (self , key , * computed_result ):
69
+ """ Phase 3: Insert results into the current table"""
70
+ self .insert1(dict (key, result = computed_result))
71
+ ```
72
+
73
+ #### Execution Flow
74
+
75
+ To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence:
76
+
77
+ ``` python
78
+ # Step 1: Fetch data outside transaction
79
+ fetched_data1 = self .make_fetch(key)
80
+ computed_result = self .make_compute(key, * fetched_data1)
81
+
82
+ # Step 2: Begin transaction and verify data consistency
83
+ begin transaction:
84
+ fetched_data2 = self .make_fetch(key)
85
+ if fetched_data1 != fetched_data2: # deep comparison
86
+ cancel transaction # Data changed during computation
87
+ else :
88
+ self .make_insert(key, * computed_result)
89
+ commit_transaction
90
+ ```
91
+
92
+ #### Key Benefits
93
+
94
+ 1 . ** Reduced Database Lock Time** : Only the fetch and insert operations occur within transactions, minimizing lock duration
95
+ 2 . ** Connection Efficiency** : Database connections are only used briefly for data transfer
96
+ 3 . ** Memory Management** : Fetched data can be processed and released during computation
97
+ 4 . ** Fault Tolerance** : Computation failures don't affect database state
98
+ 5 . ** Scalability** : Multiple computations can run concurrently without database contention
99
+
100
+ #### Referential Integrity Protection
101
+
102
+ The pattern includes a critical safety mechanism: ** referential integrity verification** .
103
+ Before inserting results, the system:
104
+
105
+ 1 . Re-fetches the source data within the transaction
106
+ 2 . Compares it with the originally fetched data using deep hashing
107
+ 3 . Only proceeds with insertion if the data hasn't changed
108
+
109
+ This prevents the "phantom read" problem where source data changes during long computations,
110
+ ensuring that results remain consistent with their inputs.
111
+
112
+ #### Implementation Details
113
+
114
+ The pattern is implemented using Python generators in the ` AutoPopulate ` class:
115
+
116
+ ``` python
117
+ def make (self , key ):
118
+ # Step 1: Fetch data from parent tables
119
+ fetched_data = self .make_fetch(key)
120
+ computed_result = yield fetched_data
121
+
122
+ # Step 2: Compute if not provided
123
+ if computed_result is None :
124
+ computed_result = self .make_compute(key, * fetched_data)
125
+ yield computed_result
126
+
127
+ # Step 3: Insert the computed result
128
+ self .make_insert(key, * computed_result)
129
+ yield
130
+ ```
131
+ Therefore, it is possible to override the ` make ` method to implement the three-part make pattern by using the ` yield ` statement to return the fetched data and computed result as above.
132
+
133
+ #### Use Cases
134
+
135
+ This pattern is particularly valuable for:
136
+
137
+ - ** Machine learning model training** : Hours-long training sessions
138
+ - ** Image processing pipelines** : Large-scale image analysis
139
+ - ** Statistical computations** : Complex statistical analyses
140
+ - ** Data transformations** : ETL processes with heavy computation
141
+ - ** Simulation runs** : Time-consuming simulations
142
+
143
+ #### Example: Long-Running Image Analysis
144
+
145
+ Here's an example of how to implement the three-part make pattern for a
146
+ long-running image analysis task:
147
+
148
+ ``` python
149
+ @schema
150
+ class ImageAnalysis (dj .Computed ):
151
+ definition = """
152
+ # Complex image analysis results
153
+ -> Image
154
+ ---
155
+ analysis_result : longblob
156
+ processing_time : float
157
+ """
158
+
159
+ def make_fetch (self , key ):
160
+ """ Fetch the image data needed for analysis"""
161
+ image_data = (Image & key).fetch1(' image' )
162
+ params = (Params & key).fetch1(' params' )
163
+ return (image_data, params) # pack fetched_data
164
+
165
+ def make_compute (self , key , image_data , params ):
166
+ """ Perform expensive image analysis outside transaction"""
167
+ import time
168
+ start_time = time.time()
169
+
170
+ # Expensive computation that could take hours
171
+ result = complex_image_analysis(image_data, params)
172
+ processing_time = time.time() - start_time
173
+ return result, processing_time
174
+
175
+ def make_insert (self , key , analysis_result , processing_time ):
176
+ """ Insert the analysis results"""
177
+ self .insert1(dict (key,
178
+ analysis_result = analysis_result,
179
+ processing_time = processing_time))
180
+ ```
181
+
182
+ The exact same effect may be achieved by overriding the ` make ` method as a generator function using the ` yield ` statement to return the fetched data and computed result as above:
183
+
184
+ ``` python
185
+ @schema
186
+ class ImageAnalysis (dj .Computed ):
187
+ definition = """
188
+ # Complex image analysis results
189
+ -> Image
190
+ ---
191
+ analysis_result : longblob
192
+ processing_time : float
193
+ """
194
+
195
+ def make (self , key ):
196
+ image_data = (Image & key).fetch1(' image' )
197
+ params = (Params & key).fetch1(' params' )
198
+ computed_result = yield (image, params) # pack fetched_data
199
+
200
+ if computed_result is None :
201
+ # Expensive computation that could take hours
202
+ import time
203
+ start_time = time.time()
204
+ result = complex_image_analysis(image_data, params)
205
+ processing_time = time.time() - start_time
206
+ computed_result = result, processing_time # pack
207
+ yield computed_result
208
+
209
+ result, processing_time = computed_result # unpack
210
+ self .insert1(dict (key,
211
+ analysis_result = result,
212
+ processing_time = processing_time))
213
+ yield # yield control back to the caller
214
+ ```
215
+ We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity.
0 commit comments