Skip to content

Commit 07030e7

Browse files
authored
Merge branch 'main' into dependabot/github_actions/codecov/codecov-action-4.5.0
2 parents e4783fd + 437889c commit 07030e7

23 files changed

+378
-337
lines changed

.github/workflows/benchmarks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jobs:
2020

2121
steps:
2222
- name: Checkout repository
23-
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
23+
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332
2424

2525
- name: Set up Go ${{ matrix.go-version }}
2626
uses: actions/setup-go@cdcb36043654635271a94b9a6d1392de5bb323a7

.github/workflows/codeql-analysis.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ jobs:
3939

4040
steps:
4141
- name: Checkout repository
42-
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
42+
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332
4343

4444
# Initializes the CodeQL tools for scanning.
4545
- name: Initialize CodeQL

.github/workflows/dep-review.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ jobs:
1717
timeout-minutes: 5
1818
steps:
1919
- name: Checkout repository
20-
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
20+
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332
2121

2222
- name: Dependency Review
2323
uses: actions/dependency-review-action@0659a74c94536054bfa5aeb92241f70d680cc78e

.github/workflows/go-lint.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919

2020
steps:
2121
- name: Checkout repository
22-
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
22+
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332
2323
with:
2424
fetch-depth: 1
2525

.github/workflows/go-unit-tests.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ jobs:
3131

3232
steps:
3333
- name: Checkout repository
34-
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
34+
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332
3535

3636
- name: Set up Go ${{ matrix.go-version }}
3737
uses: actions/setup-go@cdcb36043654635271a94b9a6d1392de5bb323a7

.github/workflows/release.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727

2828
steps:
2929
- name: Checkout repository
30-
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29
30+
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332
3131
with:
3232
fetch-depth: 0
3333
ref: "main"

README.md

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@
1414
create an instance and add multiple **Patterns** to it,
1515
and then query data objects called **Events** to
1616
discover which of the Patterns match
17-
the fields in the Event.
17+
the fields in the Event. In typical cases, Quamina
18+
can match millions of Events per second, even with
19+
many Patterns added to the instance.
1820

1921
Quamina has no run-time dependencies beyond built-in Go libraries.
2022

@@ -292,33 +294,20 @@ Events through it as is practical.
292294

293295
### `AddPattern()` Performance
294296

295-
In **most** cases, tens of thousands of Patterns per second can
297+
Tens of thousands of Patterns per second can
296298
be added to a Quamina instance; the in-memory data structure will
297-
become larger, but not unreasonably so. The amount of of
299+
become larger, but not unreasonably so. The amount of
298300
available memory is the only significant limit to the
299301
number of patterns an instance can carry.
300302

301-
The exception is `shellstyle` Patterns. Adding many of these
302-
can rapidly lead to degradation in elapsed time and memory
303-
consumption, at a rate which is uneven but at worst
304-
O(2<sup>N</sup>) in the number of patterns. A fuzz test
305-
which adds random 5-letter words with a `*` at a random
306-
location slows to a crawl after 30 or so `AddPattern()`
307-
calls, with the Quamina instance having many millions of
308-
states. Note that such instances, once built, can still
309-
match Events at high speeds.
310-
311-
This is after some optimization. It is possible there is a
312-
bug such that automaton-building is unduly wasteful but it
313-
may remain the case that adding this flavor of Pattern is
314-
simply not something that can be done at large scale.
315-
316303
### `MatchesForEvent()` Performance
317304

318305
I used to say that the performance of
319306
`MatchesForEvent` was O(1) in the number of
320307
Patterns. That’s probably a reasonable way to think
321-
about it, because it’s *almost* right.
308+
about it, because it’s *almost* right, except in the
309+
case where a very large number of `shellstyle` patterns
310+
have been added; this is discussed in the next section.
322311

323312
To be correct, the performance is a little worse than
324313
O(N) where N is the average number of unique fields in an
@@ -361,6 +350,23 @@ So, adding a new Pattern that only mentions fields which are
361350
already mentioned in previous Patterns is effectively free,
362351
i.e. O(1) in terms of run-time performance.
363352

353+
### Quamina instances with large numbers of `shellstyle` Patterns
354+
355+
A study of the theory of finite automata reveals that processing
356+
regular-expression constructs such as `*` increases the complexity of
357+
the automaton necessary to match it. It develops that when
358+
a large number of such automata are compiled together, the merged
359+
output can contain a high degree of nondeterminism which can result
360+
in a drastic slowdown.
361+
362+
A fuzz test which adds a pattern for each of 12,959 5-letter words with
363+
one `*` embedded in each at a random offset slows matching speed down to
364+
below 10,000/second, in stark contrast to most Quamina instances, which
365+
can achieve millions of matches/second.
366+
367+
This slowdown is under active investigation and it is possible that the
368+
situation will improve.
369+
364370
### Further documentation
365371

366372
There is a series of blog posts entitled

anything_but.go

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -73,20 +73,19 @@ func readAnythingButSpecial(pb *patternBuild, valsIn []typedVal) (pathVals []typ
7373
func makeMultiAnythingButFA(vals [][]byte) (*smallTable, *fieldMatcher) {
7474
nextField := newFieldMatcher()
7575
successStep := &faState{table: newSmallTable(), fieldTransitions: []*fieldMatcher{nextField}}
76-
//DEBUG successStep.table.label = "(success)"
77-
success := &faNext{steps: []*faState{successStep}}
76+
success := &faNext{states: []*faState{successStep}}
7877

79-
ret, _ := oneMultiAnythingButStep(vals, 0, success), nextField
78+
ret, _ := makeOneMultiAnythingButStep(vals, 0, success), nextField
8079
return ret, nextField
8180
}
8281

83-
// oneMultiAnythingButStep - spookeh. The idea is that there will be N smallTables in this FA, where N is
82+
// makeOneMultiAnythingButStep - spookeh. The idea is that there will be N smallTables in this FA, where N is
8483
// the longest among the vals. So for each value from 0 through N, we make a smallTable whose default is
8584
// success but transfers to the next step on whatever the current byte in each of the vals that have not
8685
// yet been exhausted. We notice when we get to the end of each val and put in a valueTerminator transition
8786
// to a step with no nextField entry, i.e. failure because we've exactly matched one of the anything-but
8887
// strings.
89-
func oneMultiAnythingButStep(vals [][]byte, index int, success *faNext) *smallTable {
88+
func makeOneMultiAnythingButStep(vals [][]byte, index int, success *faNext) *smallTable {
9089
// this will be the default transition in all the anything-but tables.
9190
var u unpackedTable
9291
for i := range u {
@@ -115,18 +114,18 @@ func oneMultiAnythingButStep(vals [][]byte, index int, success *faNext) *smallTa
115114

116115
// for each val that still has bytes to process, recurse to process the next one
117116
for utf8Byte, val := range valsWithBytesRemaining {
118-
nextTable := oneMultiAnythingButStep(val, index+1, success)
117+
nextTable := makeOneMultiAnythingButStep(val, index+1, success)
119118
nextStep := &faState{table: nextTable}
120-
u[utf8Byte] = &faNext{steps: []*faState{nextStep}}
119+
u[utf8Byte] = &faNext{states: []*faState{nextStep}}
121120
}
122121

123122
// for each val that ends at 'index', put a failure-transition for this anything-but
124123
// if you hit the valueTerminator, success for everything else
125124
for utf8Byte := range valsEndingHere {
126125
failState := &faState{table: newSmallTable()} // note no transitions
127-
lastStep := &faNext{steps: []*faState{failState}}
126+
lastStep := &faNext{states: []*faState{failState}}
128127
lastTable := makeSmallTable(success, []byte{valueTerminator}, []*faNext{lastStep})
129-
u[utf8Byte] = &faNext{steps: []*faState{{table: lastTable}}}
128+
u[utf8Byte] = &faNext{states: []*faState{{table: lastTable}}}
130129
}
131130

132131
table := newSmallTable()

cl2_test.go

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -187,20 +187,20 @@ func TestRulerCl2(t *testing.T) {
187187

188188
// initial run to stabilize memory
189189
bm := newBenchmarker()
190-
bm.addRules(exactRules, exactMatches)
190+
bm.addRules(exactRules, exactMatches, false)
191191

192192
bm.run(t, lines)
193193

194194
bm = newBenchmarker()
195-
bm.addRules(exactRules, exactMatches)
195+
bm.addRules(exactRules, exactMatches, true)
196196
fmt.Printf("EXACT events/sec: %.1f\n", bm.run(t, lines))
197197

198198
bm = newBenchmarker()
199-
bm.addRules(prefixRules, prefixMatches)
199+
bm.addRules(prefixRules, prefixMatches, true)
200200
fmt.Printf("PREFIX events/sec: %.1f\n", bm.run(t, lines))
201201

202202
bm = newBenchmarker()
203-
bm.addRules(anythingButRules, anythingButMatches)
203+
bm.addRules(anythingButRules, anythingButMatches, true)
204204
fmt.Printf("ANYTHING-BUT events/sec: %.1f\n", bm.run(t, lines))
205205
}
206206

@@ -214,13 +214,15 @@ func newBenchmarker() *benchmarker {
214214
return &benchmarker{q: q, wanted: make(map[X]int)}
215215
}
216216

217-
func (bm *benchmarker) addRules(rules []string, wanted []int) {
217+
func (bm *benchmarker) addRules(rules []string, wanted []int, report bool) {
218218
for i, rule := range rules {
219219
rname := fmt.Sprintf("r%d", i)
220220
_ = bm.q.AddPattern(rname, rule)
221221
bm.wanted[rname] = wanted[i]
222222
}
223-
fmt.Println(matcherStats(bm.q.matcher.(*coreMatcher)))
223+
if report {
224+
fmt.Println(matcherStats(bm.q.matcher.(*coreMatcher)))
225+
}
224226
}
225227

226228
func (bm *benchmarker) run(t *testing.T, events [][]byte) float64 {

core_matcher.go

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ func (m *coreMatcher) deletePatterns(_ X) error {
129129
// matchesForJSONEvent calls the flattener to pull the fields out of the event and
130130
// hands over to MatchesForFields
131131
// This is a leftover from previous times, is only used by tests, but it's used by a *lot*
132-
// so removing it would require a lot of tedious work
132+
// and it's a convenient API for testing.
133133
func (m *coreMatcher) matchesForJSONEvent(event []byte) ([]X, error) {
134134
fields, err := newJSONFlattener().Flatten(event, m.getSegmentsTreeTracker())
135135
if err != nil {
@@ -178,20 +178,27 @@ func (m *coreMatcher) matchesForFields(fields []Field) ([]X, error) {
178178
}
179179
matches := newMatchSet()
180180

181+
// pre-allocate a pair of buffers that will be used several levels down the call stack for efficiently
182+
// transversing NFAs
183+
bufs := &bufpair{
184+
buf1: make([]*faState, 0),
185+
buf2: make([]*faState, 0),
186+
}
187+
181188
// for each of the fields, we'll try to match the automaton start state to that field - the tryToMatch
182189
// routine will, in the case that there's a match, call itself to see if subsequent fields after the
183190
// first matched will transition through the machine and eventually achieve a match
184191
s := m.fields()
185192
for i := 0; i < len(fields); i++ {
186-
tryToMatch(fields, i, s.state, matches)
193+
tryToMatch(fields, i, s.state, matches, bufs)
187194
}
188195
return matches.matches(), nil
189196
}
190197

191198
// tryToMatch tries to match the field at fields[index] to the provided state. If it does match and generate
192199
// 1 or more transitions to other states, it calls itself recursively to see if any of the remaining fields
193200
// can continue the process by matching that state.
194-
func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSet) {
201+
func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSet, bufs *bufpair) {
195202
stateFields := state.fields()
196203

197204
// transition on exists:true?
@@ -200,16 +207,16 @@ func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSe
200207
matches = matches.addXSingleThreaded(existsTrans.fields().matches...)
201208
for nextIndex := index + 1; nextIndex < len(fields); nextIndex++ {
202209
if noArrayTrailConflict(fields[index].ArrayTrail, fields[nextIndex].ArrayTrail) {
203-
tryToMatch(fields, nextIndex, existsTrans, matches)
210+
tryToMatch(fields, nextIndex, existsTrans, matches, bufs)
204211
}
205212
}
206213
}
207214

208215
// an exists:false transition is possible if there is no matching field in the event
209-
checkExistsFalse(stateFields, fields, index, matches)
216+
checkExistsFalse(stateFields, fields, index, matches, bufs)
210217

211218
// try to transition through the machine
212-
nextStates := state.transitionOn(&fields[index])
219+
nextStates := state.transitionOn(&fields[index], bufs)
213220

214221
// for each state in the possibly-empty list of transitions from this state on fields[index]
215222
for _, nextState := range nextStates {
@@ -221,17 +228,17 @@ func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSe
221228
// of the same array
222229
for nextIndex := index + 1; nextIndex < len(fields); nextIndex++ {
223230
if noArrayTrailConflict(fields[index].ArrayTrail, fields[nextIndex].ArrayTrail) {
224-
tryToMatch(fields, nextIndex, nextState, matches)
231+
tryToMatch(fields, nextIndex, nextState, matches, bufs)
225232
}
226233
}
227234
// now we've run out of fields to match this state against. But suppose it has an exists:false
228235
// transition, and it so happens that the exists:false pattern field is lexically larger than the other
229236
// fields and that in fact such a field does not exist. That state would be left hanging. So…
230-
checkExistsFalse(nextStateFields, fields, index, matches)
237+
checkExistsFalse(nextStateFields, fields, index, matches, bufs)
231238
}
232239
}
233240

234-
func checkExistsFalse(stateFields *fmFields, fields []Field, index int, matches *matchSet) {
241+
func checkExistsFalse(stateFields *fmFields, fields []Field, index int, matches *matchSet, bufs *bufpair) {
235242
for existsFalsePath, existsFalseTrans := range stateFields.existsFalse {
236243
// it seems like there ought to be a more state-machine-idiomatic way to do this, but
237244
// I thought of a few and none of them worked. Quite likely someone will figure it out eventually.
@@ -250,9 +257,9 @@ func checkExistsFalse(stateFields *fmFields, fields []Field, index int, matches
250257
if i == len(fields) {
251258
matches = matches.addXSingleThreaded(existsFalseTrans.fields().matches...)
252259
if thisFieldIsAnExistsFalse {
253-
tryToMatch(fields, index+1, existsFalseTrans, matches)
260+
tryToMatch(fields, index+1, existsFalseTrans, matches, bufs)
254261
} else {
255-
tryToMatch(fields, index, existsFalseTrans, matches)
262+
tryToMatch(fields, index, existsFalseTrans, matches, bufs)
256263
}
257264
}
258265
}

0 commit comments

Comments
 (0)