Skip to content

Commit 30dc718

Browse files
authored
training.py: two tweaks to feature selection (#226)
1. Include posting amounts as a feature. This allows us to distinguish different classes of payments to the same payee (e.g. recurring membership fees, which often have a constant amount, from individual purchases). 2. For example key/value pairs, include the key by itself (with no substring of the value) as a feature. This is useful because different account types often have non-overlapping sets of example keys, and including the bare key as a value allows the decision tree to be effectively segmented by account type fairly close to the root. These two very small changes significantly improve training accuracy on my journal, from 94.81% to 99.32% (an 86% reduction in error rate!).
1 parent f8fcb72 commit 30dc718

File tree

2 files changed

+8
-3
lines changed

2 files changed

+8
-3
lines changed

beancount_import/training.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,11 @@
3030
def get_features(example: PredictionInput) -> Dict[str, bool]:
3131
features = collections.defaultdict(lambda: False) # type: Dict[str, bool]
3232
features['account:%s' % example.source_account] = True
33-
34-
# For now, skip amount and date.
33+
features['amount:%s' % example.amount.currency] = example.amount.number
34+
# For now, skip date.
3535

3636
for key, values in example.key_value_pairs.items():
37+
features[key] = True
3738
if isinstance(values, str):
3839
values = (values, )
3940
for value in values:

beancount_import/training_test.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import datetime
22

33
from beancount.core.data import Amount
4+
from beancount.core.number import D
45
from . import test_util
56
from . import training
67

@@ -21,7 +22,10 @@ def test_get_features():
2122
'a:hello': True,
2223
'b:foo': True,
2324
'b:bar': True,
24-
'b:foo bar': True
25+
'b:foo bar': True,
26+
'a': True,
27+
'b': True,
28+
'amount:USD': D(3)
2529
}
2630

2731

0 commit comments

Comments
 (0)