-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathnotes-a
21404 lines (16967 loc) · 707 KB
/
notes-a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Working Notes (Archived; Part A)
--------------------------------
This file contains a diary of random working notes, which I use to keep
track of what the heck it is that I'm doing. It is almost surely totally
useless to you, except maybe for some weird voyeuristic reasons.
Archived in 2021; this covers the time period from the start of the
project to Dec 2019. There was no activity on this project in 2020.
======================================================================
Jan-Feb 2014
Handy tools
-----------
Some handy SQL commands:
```
SELECT count(uuid) FROM Atoms;
select count(uuid) from atoms where type =123;
```
type 123 is `WordNode` for me; verify with
```
SELECT * FROM Typecodes;
```
The total count accumulated is
```
select sum(floatvalue[3]) from valuations where type=7;
```
where type 7 is `CountTruthValue`.
Pair-counting batch results
---------------------------
Example stats and performance:
current fr_pairs db has 16785 words and 177960 pairs.
This takes 17K + 2x 178K = 370K total atoms loaded.
These load up in 10-20 seconds-ish or so.
New fr_pairs has 225K words, 5M pairs (10.3M atoms):
Load 10.3M atoms, which takes about 10 minutes cpu time to load
20-30 minutes wall-clock time (500K atoms per minute, 9K/second
on an overloaded server).
RSS for cogserver: 436MB, holding approx 370K atoms
So this is about 1.2KB per atom, all included. Atoms are a bit fat...
... loading all pairs is very manageable even for modest-sized machines.
RSS for cogserver: 10GB, holding 10.3M atoms
So this is just under 1KB per atom.
(By comparison, direct measurement of atom size i.e. class Atom:
typical atom size: 4820384 / 35444 = 136 Bytes/atom
this is NOT counting indexes, etc.)
For dataset (fr_pairs) with 225K words, 5M pairs:
Current rate is 150 words/sec or 9K words/min.
After the single-word counts complete, and all-pair count is done.
This is fast, takes a couple of minutes.
Next: batch-logli takes 540 seconds for 225K words
Finally, an MI compute stage.
Current rate is 60 words/sec = 3.6K per minute.
This rate is per-word, not per word-pair .
Update Feb 2014: fr_pairs now contains 10.3M atoms
SELECT count(uuid) FROM Atoms; gives 10324863 (10.3M atoms)
select count(uuid) from atoms where type = 77; gives 226030 (226K words)
select count(uuid) from atoms where type = 8; gives 5050835 (5M pairs ListLink)
select count(uuid) from atoms where type = 27; gives 5050847 (5M pairs EvaluationLink)
Performance
-----------
Performance seems to suck:
-- two parsers, each takes maybe 4% cpu time total. Load avg of about 0.03
-- each parser runs 4 async write threads pushing atoms to postgres.
each one complains about it taking too long to flush the write queues.
-- postmaster is running 10 threads, load-avg of about 2.00 so about
2 cpu's at 100%
-- vmstat shows 500 blks per second written. This is low...
-- top shows maybe 0.2% wait state. So its not disk-bound.
-- what is taking so long?
So, take a tcpdump:
-- a typical tcpdump packet:
UPDATE Atoms SET tv_type = 2, stv_mean = 0 , stv_confidence = 0, stv_count = 54036 WHERE uuid = 367785;
its maybe 226 bytes long.
-- this gets one response from server, about 96 bytes long.
-- then one more req, one more repsonse, seems to be a 'were'done' mesg
or something ... which I guess is due to SQLFreeHandle(SQL_HANDLE_STMT ???
-- time delta in seconds, of tcpdump of traffic packets, between update, and
response from server:
0.0006 0.0002 0.0002 0.0002 0.028 (yow!!) 0.001 0.0002
-- so it looks like about every 8-10 packets are replied to fairly quick,
then there's one that takes 0.025 seconds to reply.... stair-steps in
response time like this all the way through the capture.
Wild guess:
-- Hmm ... this seems to be related to the commit delay in postgresql.conf
Change commit_delay to 1 second
change wal_bufers to 32MB since its mostly update traffic.
change checkpoint_segments to 32 (each one takes up 16MB of disk space.)
-- Making these changes has no obvious effect ... bummer.
I don't get it; performance sucks and I don't see why. Or rather: postmaster
is chewing up vast amounts of cpu time for no apparent reason...
select * from pg_stat_user_tables;
select * from pg_stat_all_tables;
select * from pg_statio_user_tables;
select * from pg_database;
pg_stat_user_indexes
pg_stat_all_indexes
select * from pg_catalog.pg_stat_activity;
select * from pg_catalog.pg_locks;
-- WOW!!! VACUUM ANALYZE; had a huge effect!!
-- vacuum tells em to do following:
change max_fsm_pages to 600K
chage max_fsm_relations to 10K
Anyway ... performance measured as of 27 Dec 2013:
Takes about 105 millisecs to clear 90 eval-links from the write-back
queues. This each eval-link is 5 atoms (eval, defind, list, word, word)
so this works out to 5*90 atoms /0.105 seconds = 4.3KAtoms/sec
which is still pretty pathetic...
gdb:
---
handle SIGPWR nostop noprint
handle SIGXCPU nostop noprint
How about using a reader-writer lock?
----------------------------------
boost::shared_lock for reading,
unique_lock for writing ...
upgrade_lock<shared_mutex> lock(workerAccess);
upgrade_to_unique_lock<shared_mutex> uniqueLock(lock);
shared_mutex
write uses: unique_lock<shared_mutex>
readers use shared_lock<shared_mutex>
writer does:
// get upgradable access
boost::upgrade_lock<boost::shared_mutex> lock(_access);
// get exclusive access
boost::upgrade_to_unique_lock<boost::shared_mutex> uniqueLock(lock);
// now we have exclusive access
}
am using boost-1.49 on cray
Some typical entropies for word-pairs
-------------------------------------
Three experiments:
1) Get H(de,*) H(*,de) H(en,*) H(*,en) and compare to
H(de+en,*) H(*, de+en)
2) H(vieux,*) H(* vieux) H(nouveaux, *) H(*, nouveaux)
3) H(vieille, *) etc + vieux
4) H(le, *) H(la,*) vs. H(le+la)
5) H(le,*) H(famille,*) which should fail ...!?
Some typical entropies for word-pairs
-------------------------------------
The below is arithmeticaly correct, but theoretically garbage.
(WordNode "famille") entropy H=11.185
H(*, famille) = 11.195548
H(famille, *) = 11.174561
MI(et, famille) = -5.2777815
H(et, *) = 5.5696678
P(et, *) = 0.021055372363972875
thus:
H(et, famille) = -MI(et, famille) + H(famille, *) + H(et, *) = 22.0220103
P(et, famille) = 2.348087815164205e-7
MI(de, famille) = 2.1422486
H(de, *) = 4.3749881
P(de, *) = 0.04819448582223504
H(de, famille) = -2.1422486 + 4.3749881 + 11.195548 = 13.4282875
P(de, famille) = 9.071574511509601e-5
P(de+et. *) = 0.06924985818620791
H(de+et, *) = 3.8520450730427047
P(de+et, famille) = 9.095055389661243e-5
H(de+et, famille) = 13.424558050397735
MI(de+et, famille) = 1.6230350226449701
So MI(et, famille) < MI(de+et, famille) < MI(de, famille)
-5.2777815 < 1.6230350226 < 2.1422486
By contrast, the arithmetic average is:
(MI(de, famille) * P(de, famille) + MI(et, famille) * P(et, famille)) /
(P(de, famille) + P(et, famille))
= 2.1230921666199825
Change in entropy:
MI(de, famille) * P(de, famille) + MI(et, famille) * P(et, famille) = 0.0012169
MI(de+et, famille) * P(de+et, famille) = 1.476159343e-4
Oh, wait ...
H(de, famille) * P(de, famille) + H(et, famille) * P(et, famille) = 0.001223328
H(de+et, famille) * P(de+et, famille) = 0.00122097099
Change in entropy = 0.00122097099 - 0.001223328 = -2.35701e-6
-------
H(de) = 4.3808608
H(et) = 5.5862331
P(de) = 0.04799870191172842
P(et) = 0.02081499323761464
P(de+et) = 0.06881369514934306
H(de+et) = 3.8611604742976153 = -log_2 (P(de)+P(et))
By contrast, the weighted average is
(P(de)*H(de) + P(et)*H(et)) /(P(de) + P(et)) = 4.745465784790553
Combinations:
P(de+et)*H(de+et) = 0.2657007
P(de)*H(de) + P(et)*H(et) = 0.32655303
The change in entropy, from forming a union, is:
P(de+et)*H(de+et) - P(de)*H(de) - P(et)*H(et) = -0.060852316
Recap: Delta(de+et) = -0.060852316
Delta(de+et, famille) = -2.35701e-6
Entropy increases (strongly) if word-pair merged, words are separated,
-------
MI(d'une, famille) = 5.230504
H(d'une, *) = 9.792551
H(la) = 5.6536283
H(la, *) = 5.5858526
sa
est
de
H(d'une) = 9.7960119
H(un) = 7.1578913
H(et) = 5.5862331
-----
repeat, for vielle+nouveaux
H(nouveaux) = 14.28815
P(nouveaux) = 4.998483553100357e-5
H(vieille) = 16.16037
P(vieille) = 1.365349e-5
P(nouveaux+vieille) = 6.363833e-5
H(nouveaux+vieille) = 13.93974
P(nouveaux+vieille)*H(nouveaux+vieille) = 8.87102088-4
P(nouveaux)*H(nouveaux) + P(vieille)*H(vieille) = 9.3483638e-4
Change in entropy is diff of the two: -4.7734297e-5
-----
repeat, for vielle+nouveaux
H(*, famille) = 11.195548
H(nouveaux, *) = 13.974219
P(nouveaux, *) = 6.213565989765264e-5
MI(nouveaux, famille) = 5.2966957
H(nouveaux, famille) = 19.8730713
P(nouveaux, famille) = 1.0413804797188067e-6
H(vieille, *) = 15.998603
P(vieille, *) = 1.5273571710064995e-5
MI(vieille, famille) = 10.195547
H(vieille, famille) = 16.998604
P(vieille, famille) = 7.636780561617735e-6
P(vieille+nouveaux, famille) = 8.678161041336542e-6
H(vieille+nouveaux, famille) = 16.814179210712517
P(vieille+nouveaux, *) = 7.740923160771763e-5
H(vieille+nouveaux, *) = 13.657134846045357
MI(vieille+nouveaux, famille) = 8.038503635332841
so MI(nouveaux, famille) < MI(vieille+nouveaux, famille) < MI(vieille, famille)
5.2966957 < 8.038503635332841 < 10.195547
Change in entropy:
P(nouveaux, famille)*H(nouveaux, famille) + P(vieille, famille)*H(vieille, famille)
= 1.505100371e-4
P(vieille+nouveaux, famille) * H(vieille+nouveaux, famille) = 1.459161549e-4
Change = 1.459161549e-4 - 1.505100371e-4 = -4.5938821e-6
To recap: Delta(vieille+nouveaux) = -4.7734297e-5
reduces the entropy more than
Delta(vieille+nouveaux, famille) = -4.5938821e-6
i.e. entropy increses if the word-pairs are merged, the words are separated.
======================================================================
Minimal morphology output
;; Lets say that there was one word in the sentence, it was 'foobar'
;; and the splitter split it into foo and bar
;; then the following should be generated:
;; for each sentence, create one of these, each with a distinct uuid:
(ParseLink (stv 1 1)
(ParseNode "sentence@fc98a97a-4753-45d9-be5b-1c752b5b21d9_parse_0")
(SentenceNode "sentence@fc98a97a-4753-45d9-be5b-1c752b5b21d9")
)
;; For each pair of morphemes, cereate the below:
(EvaluationLink (stv 1.0 1.0)
(LinkGrammarRelationshipNode "MOR")
(ListLink
(WordInstanceNode "foo@5e179119-3966-4bb9-8a38-ef2014b48f12")
(WordInstanceNode "bar@cb2443bb-fbec-472c-baee-36b822579861")
)
)
;; For each "word" aka morpheme, create these two clauses:
;; note that the UUID's match up exactly with the above.
;; the below shows only "foo", another pair is needed for "bar".
(ReferenceLink (stv 1.0 1.0)
(WordInstanceNode "foo@5e179119-3966-4bb9-8a38-ef2014b48f12")
(WordNode "foo")
)
(WordInstanceLink (stv 1.0 1.0)
(WordInstanceNode "foo@5e179119-3966-4bb9-8a38-ef2014b48f12")
(ParseNode "sentence@fc98a97a-4753-45d9-be5b-1c752b5b21d9_parse_0")
)
;; finally, at the very end:
;; again, the UUID must match with what was given above.
(ListLink (stv 1 1)
(AnchorNode "# New Parsed Sentence")
(SentenceNode "sentence@68e51cae-98bc-4102-b19c-78649c5f6cfb")
)
======================================================================
Tagalog status:
31 july 2014
4519631 = 4.5M morpehem pairs
204K morpehemes
======================================================================
Setup, July 2015
----------------
LXC container on gnucash.org
AtomSpace.cc line 303
LXC container on backlot
------------------------
nlp-base and nlp-server (currently used by rohit)
morf-server (currrently used by ainish)
LXC container on fanny
----------------------
cd src/learn
./run-all-servers.sh
tmux attach
psql en_pairs
======================================================================
======================================================================
31 Dec 2016
-----------
psql en_pairs
\dt
select count(*) from atoms;
18487291
loadmodule libPersistModule.so
sql-open learn-pairs linas asdf
sql-open en-pairs linas asdf
password authentication failed for user "linas"
sql-open opencog_test opencog_tester cheese
/etc/postgresql/9.6/main/pg_hba.conf looks OK...
So: I have en_pairs and
opencog_test | linas | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
but the owner of the tables is opencog_tester
/var/log/postgresql/postgresql-9.6-main.log
2017-01-03 16:15:50 CST [2856-1] linas@en_pairs FATAL: password authentication
failed for user "linas"
2017-01-03 16:15:50 CST [2856-2] linas@en_pairs DETAIL: User "linas" has no password
assigned.
2017-01-01 17:20:54 CST [4829-1] opencog_tester@opencog_test ERROR: insert or update
on table "atoms" violates foreign key constraint "atoms_space_fkey"
2017-01-01 17:20:54 CST [4829-2] opencog_tester@opencog_test DETAIL: Key (space)=(2)
is not present in table "spaces".
(use-modules (opencog persist-sql))
(sql-open "en-pairs" "learner" "asdf")
\du
alter user learner password 'asdf';
grant CONNECT ON DATABASE en_pairs to learner;
grant SELECT,INSERT,UPDATE on table atoms to learner;
sql-open en-pairs learner asdf
it worked!
(sql-load)
18 million atoms
Loaded 9045489 atoms at height 2
Finished loading 18487291 atoms in total
12:35 to load .. !? 18487291 atoms/755 secs = 24.5K atoms/sec
psql -h localhost -U ubuntu lt_pairs
ALTER USER ubuntu PASSWORD 'asdf';
========================================================
-----------------------------------------
lxc -- create an all-updated opencog-base
lxc-start -n opencog-base --daemon
time lxc-copy -n opencog-learn -N learn-lt
------------------------------------
4 Jan 2017
----------
https://dumps.wikimedia.org/zhwiki/20170101/
https://dumps.wikimedia.org/zh_yuewiki/20170101/
https://dumps.wikimedia.org/frwiki/20170101/
lynx https://dumps.wikimedia.org/ltwiki/20170101/ltwiki-20170101-pages-articles-multistream.xml.bz2
time cat ltwiki-20170101-pages-articles-multistream.xml.bz2 |bunzip2 |/home/ubuntu/src/relex/src/perl/wiki-scrub.pl
real 4m58.871s
user 5m35.652s
sys 0m11.700s
find |wc gives 209011 total articles
find |wc gives 178514 after cat/template removal
createdb lt_pairs
createdb lt_morph
cat opencog/persist/sql/odbc/atom.sql | psql lt_pairs
cat opencog/persist/sql/odbc/atom.sql | psql lt_morph
=============================================================
time cat zh_yuewiki-20170101-pages-articles.xml.bz2 |bunzip2 |/home/ubuntu/src/relex/src/perl/wiki-scrub.pl
about 48 seconds
find |wc gives 67363 total articles
find |wc gives 49170 after cat/template removal
apt-get install fonts-arphic-ukai fonts-arphic-uming fonts-babelstone-han
fonts-wqy-zenhei fonts-hanazono
fonts-arphic-bkai00mp
fonts-arphic-bsmi00lp
fonts-arphic-gbsn00lp
fonts-arphic-gkai00mp
Arghh. None of the above provide the Kangxi radicals for the terminal.
Which I think are coming from fonts-wqy-microhei
U+2F13 Kangxi Radicals,
U+42AA U+4401 CJK_Ext_A CJK-Ext.A
createdb yue_pairs
cd ~/src/atomspace
cat opencog/persist/sql/odbc/atom.sql | psql yue_pairs
。。
\p{Block: CJK}
\p{Block=CJK_Symbols_And_Punctuation}
\p{Punct}
\p{InCJK})
\p{Close_Punctuation} aka \p{Pe} (close parent)
\p{Final_Punctuation} aka \p{Pf}) more quote-close or open.. things
\p{Ps} open quote
full stop
relex-server-port relex-server-host
; -- count-all -- Return the total number of atoms in the atomspace.
; -- cog-get-atoms -- Return a list of all atoms of type 'atom-type'
; -- cog-prt-atomspace -- Prints all atoms in the atomspace
; -- cog-count-atoms -- Count of the number of atoms of given type.
; -- cog-report-counts -- Return an association list of counts.
wtf
(define (foo atom) (display "duude\n")(display atom) (newline) #f)
WARNING: No known abbreviations for language 'yue', attempting fall-back
to English version.. FIXED
odbc is still logging! FIXED
CommLog = No in /etc/odbcinst.ini
don't use "foo", it prints a warning .. better yet, don't warn! FIXED
below is due to bug opencog/relex#248 and is now fixed.
It needed a new link-grammar version
link-grammar: Error: EMPTY-WORD.zzz must be defined!
link-grammar: Error: Word 'EMPTY-WORDzzz': Internal error: NULL X_node
link-grammar: Error: sentence_split(): Internal error detected
Warning: No parses found for:
港 區 全 國 人 大 代 表 係 代 表 香 港 居 民 響 中 華 人 民 共 和 國 全 國 人 民 代 表 大 會 行 使 國 家 立 法 權 嘅 代 表 , 名 額 36 人 (1997 年 香 港 主 權 移 交 之 後 )。
link-grammar: Error: EMPTY-WORD.zzz must be defined!
link-grammar: Error: Word 'EMPTY-WORDzzz': Internal error: NULL X_node
link-grammar: Error: sentence_split(): Internal error detected
Warning: No parses found for:
深 圳 習 慣 叫 特 區 範 圍 做 「 關 內 」, 而 特 區 範 圍 之 外 嘅 ,
包 括 寶 安 區 、 龍 崗 區 同 光 明 新 區 、 坪 山 新 區 就 叫 「 關 外
」; 由 「 關 外 」 入 特 區 叫 「 入 關 」, 反 之 係 「 出 關 」。
<title>永利街</title> contains junk yest it does...
Started 5 Jan 2017 16:00 exactly.
ten minutes later: 5048 atoms -- so 500 atoms per minute...
after some halts and hiccups:
2211 articles after 1 hour = 37 articles/minute
18529 atoms after about 1 hour ...
or about 8.4 atoms per article...
There are only about 48K articles, so it should conclude in 24 hours
...!?
hours later... 219127 atoms 6345 articles done ...
Now its about 34.5 atoms per article.. whoa ...
java claims to have parsed 12424 sentences
11459 articles processed.
457009 atoms
29291 articles processed 19919 remaining
668042 atoms ...
; -- cog-report-counts -- Return an association list of counts.
(count-all)
(cog-report-counts)
(gc-stats)
... ram usage slowly increasing...
(gc-stats)
((gc-time-taken . 315428672862) (heap-size . 3479842816) (heap-free-size
. 1406132224) (heap-total-allocated . 132540491040)
(heap-allocated-since-gc . 906048240) (protected-objects . 500)
(gc-times . 414))
(gc-stats)
((gc-time-taken . 327615529234) (heap-size . 3582787584) (heap-free-size
. 1584795648) (heap-total-allocated . 138942040624)
(heap-allocated-since-gc . 1123554048) (protected-objects . 500)
(gc-times . 422))
((gc-time-taken . 491224665617) (heap-size . 4601581568) (heap-free-size
. 2257489920) (heap-total-allocated . 211782122608)
(heap-allocated-since-gc . 399939472) (protected-objects . 500)
(gc-times . 485))
if (number-of-cells-collected-recently < GUILE_MIN_YIELD_X)
then
allocate-new-heap
else
run-a-collection
`scm_i_gc_grow_heap_p ()' and `scm_gc_for_newcell ()'.)
(WordSequenceLink lots of these ...
gcprof procedure in the statprof library
https://www.gnu.org/software/guile/manual/html_node/Statprof.html
guile-yue> (statprof-display)
% cumulative self self total
time seconds seconds calls ms/call ms/call name
49.18 11506.88 11506.88 13516 851.35 851.35 inc
4.92 23397.33 1150.69 27 42618.08 866567.64 catch
4.92 12657.57 1150.69 215 5352.04 58872.42 cog-map-type
3.28 767.13 767.13 1420 540.23 540.23 char=?
3.28 767.13 767.13 619 1239.30 1239.30 write-char
3.28 767.13 767.13 372 2062.17 2062.17 memq
3.28 767.13 767.13 182 4214.97 4214.97 call-with-output-string
1.64 2684.94 383.56 182 2107.49 14752.41 tilde-dispatch
1.64 383.56 383.56 29 13226.30 13226.30 close-port
1.64 383.56 383.56 240 1598.18 1598.18 assv-ref
...
0.00 12657.57 0.00 215 0.00 58872.42 cog-count-atoms
above over about 24K seconds total, so accurate... ish
(use-modules (statprof))
(statprof-reset 0 50000 #t) ;
(statprof-start)
(do-something)
(statprof-stop)
(statprof-display)
(gcprof (λ () (observe-text "1769 年 : 伊 萬 克 雷 洛 夫 , 俄 國 寓 言 作 家 1910 年 : 威 廉 肖 克 利 (William Shockley), 美 國 物 理 學 家 , 有 份 發 明 半 導 體 ,1956 年 諾 貝 爾 物 理 獎 得 主 1915 年 : 昂 山 , 緬 甸 國 父 1921 年 : 趙 無 極 , 法 國 華 裔 畫 家 1974 年 :Robbie Williams, 英 國 歌 手 1974 年 : 馬 國 明 , 香 港 無 綫 電 視 演 員 1981 年 : 何 紫 綸 , 香 港 模 特 兒 1990 年 : 西 藏 第 十 一 世 班 禪 額 爾 德 尼 金 瑞 瑤 , 台 灣 音 樂 經 理 人 1993 年 : 宋 希 濂 , 抗 日 戰 爭 同 國 共 內 戰 時 期 中 國 國 民 黨 將 軍 2006 年 : 王 選 , 中 國 計 算 機 學 者 , 發 明 漢 字 激 光 照 排 技 術")))
Maybe use it in "observe-text"? ...
total time is correct...
... its a thread thing. staprof with threads is borked.
See comments in ./module/statprof.scm ~ Implementation notes ~
compute-mi.scm: (for-each inc atom-list)
compute-mi.scm: (define (inc atom) (set! cnt (+ cnt (tv-count
(cog-tv atom)))))
Maybe lots and lots of threads ... ? Seems to get very backed-up.
No .. only 14 threads
(hash-map->list cons (module-obarray (current-module)))
(module-map (λ (sym var) sym) (resolve-interface '(guile)))
(module-map (λ (sym var) sym) (resolve-interface '(opencog)))
(module-map (λ (sym var) sym) (resolve-interface '(opencog learn)))
who is using a module?
(module-uses (resolve-module '(guile-user)))
>>>> excellent for modules!
http://git.net/ml/guile-user-gnu/2016-06/msg00040.html
101000 26627 256 11.9 15129736 11795708 pts/6 Sl+ 18:41 516:33 guile
-l pair-count-yue.scm
not being split: FIXED.
fix is
$text =~ s/([\.?!]) *(\p{InCJK})/$1\n$2/g;
呢度啲路順序係從南去到北嚟排列嘅,其中加粗咗嘅字係主幹道:美華北路.新河浦二橫路.新河浦五橫路.新慶路.煙墩路、寺右新馬路.寺貝通津.共和大街.松崗東.共和西路.中山一路.
2012年,《向前走向愛走》.第四十五屆金鐘獎個人獎戲劇節目女主角獎.郭采潔官方網站.
No database persistant storage configured! Use the STORAGE config
keyword to define.
Java gets slower and slower
=========================
replace call to scm_gc_register_collectable_memory by call to
scm_gc_register_allocation(size)
whoa ---
GC Warning: Repeated allocation of very large block (appr. size 27369472):
May lead to memory leak and poor performance.
GC Warning: Repeated allocation of very large block (appr. size 28766208):
May lead to memory leak and poor performance.
Loaded 280000 atoms.
GC Warning: Repeated allocation of very large block (appr. size 28766208):
May lead to memory leak and poor performance.
GC Warning: Repeated allocation of very large block (appr. size 28766208):
May lead to memory leak and poor performance.
Loaded 270000 atoms.
GC Warning: Repeated allocation of very large block (appr. size 28766208):
May lead to memory leak and poor performance.
GC Warning: Repeated allocation of very large block (appr. size 14385152):
May lead to memory leak and poor performance.
Loaded 260000 atoms.
================================
fresh, guile-2.0
(gc-stats)
$6 = ((gc-time-taken . 114568428) (heap-size . 14409728) (heap-free-size .
2711552) (heap-total-allocated . 18881904) (heap-allocated-since-gc .
1054528) (protected-objects . 137) (gc-times . 14))
after half-minute:
(gc-stats)
((gc-time-taken . 7534031063) (heap-size . 19734528) (heap-free-size
. 5259264) (heap-total-allocated . 939889168) (heap-allocated-since-gc .
1657120) (protected-objects . 143) (gc-times . 326))
guile> foo
(ConceptNode "foo" (ctv 0 0 2520410))
6861 101000 20 0 735404 42952 16680 R 109.9 0.0 14:33.40 guile
6861 101000 20 0 735404 42952 16680 R 114.5 0.0 44:35.68 guile
(gc-stats)
((gc-time-taken . 148069551756) (heap-size . 19734528) (heap-free-size .
2740224) (heap-total-allocated . 28219212944) (heap-allocated-since-gc .
790384) (protected-objects . 143) (gc-times . 4930))
guile> foo
(ConceptNode "foo" (ctv 0 0 54310643))
replace call to scm_gc_register_collectable_memory by call to
scm_gc_register_allocation(size)
static std::atomic<size_t> _tv_pend_cnt;
static std::atomic<size_t> _tv_total_cnt;
static std::atomic<size_t> _tv_pend_sz;
static std::atomic<size_t> _tv_total_sz;
(define (inc atom) (cog-set-tv! atom (cog-new-ctv 0 0 (+ 1 (tv-count (cog-tv
atom))))))
scheme@(guile-user)>
scheme@(guile-user)> (define foo (Concept "foo"))
scheme@(guile-user)> (define (loo) (inc foo) (loo))
scheme@(guile-user)> (loo)
duuude its pend cnt=11425 (274200) tot=1400000 (33600000)
duuude its pend cnt=14379 (345096) tot=1500000 (36000000)
duuude its pend cnt=15600 (374400) tot=2200000 (52800000)
duuude its pend cnt=8047 (193128) tot=48300000 (1159200000)
duuude its pend cnt=1230 (29520) tot=49300000 (1183200000)
duuude its pend cnt=12432 (298368) tot=49400000 (1185600000)
duuude its pend cnt=19857 (476568) tot=50000000 (1200000000)
(gc-stats)
((gc-time-taken . 54370538078) (heap-size . 10948608) (heap-free-size . 1925120)
(heap-total-allocated . 8532174704) (heap-allocated-since-gc . 10336)
(protected-objects . 7) (gc-times . 2051))
guile-yue> foo
(ConceptNode "foo" (ctv 0 0 26737610))
4617 linas 20 0 847596 50284 27072 R 133.2 0.1 5:26.89 guile
(ConceptNode "foo" (ctv 0 0 46416463))
duuude its pend cnt=20565 (493560) tot=94300000 (2263200000)
so -- 46M incrs but 94M take-tvs -- so two takes for each incr.
-- one to get the value, one to set the value.
4617 linas 20 0 847584 50612 27156 R 135.2 0.1 23:46.94 guile
(define (rate)
(define shu (Concept "shu"))
(define cnt 0)
(define start (- (current-time) 0.1))
(define (finc atom)
(if (eq? 0 (modulo cnt 100000))
(begin (display "rate=")
(display (/ cnt (- (current-time) start))) (newline)))
(set! cnt (+ cnt 1))
(cog-set-tv! atom (cog-new-ctv 0 0 (+ 1 (tv-count (cog-tv atom))))))
(define (floo) (finc shu) (floo))
(floo)
)
(define (inc atom) (cog-set-tv! atom (cog-new-ctv 0 0 (+ 1 (tv-count (cog-tv atom))))))
(define foo (Concept "foo"))
(define (loo) (inc foo) (loo))
(statprof-stop)
(statprof-display)
with the atomics: rate == about 145.5K/sec
without the atomics: about 103.2K/sec !!
again with atomics: rate == 125K/sec !! wtf .. why not as high as before?
stop restart, rate=130K ...
stop restart - rate= 107K ... wtf
stop, restart = 109K
stop restart = 107K glargle
again --- without atomics:
rate = 147K dafuq
stop restart = 151K
stop restart = 79K crazy shit
stop restart = 141K this is so not making sense, except as a
crazy cache-line issue.
clean start: without atomics
(gc-stats)
((gc-time-taken . 845338507) (heap-size . 5963776) (heap-free-size . 421888)
(heap-total-allocated . 73153424) (heap-allocated-since-gc . 902704)
(protected-objects . 7) (gc-times . 65))
(gc-stats)
((gc-time-taken . 605534396784) (heap-size . 6025216) (heap-free-size . 356352)
(heap-total-allocated . 29673578224) (heap-allocated-since-gc . 361904)
(protected-objects . 7) (gc-times . 32428))
no growth at all.
OK, so ... a leak in sql?
a leak in TLB!! ... no because that doesn't explain guile heap...
unless guile heap is confused...
on startup:
(gc-stats)
$1 = ((gc-time-taken . 170774772) (heap-size . 15364096) (heap-free-size .
3166208) (heap-total-allocated . 18421440) (heap-allocated-since-gc .
770768) (protected-objects . 149) (gc-times . 15))
wtf .. why no printing?
_tv_pend_cnt++;
_tv_pend_sz += sizeof(*tv);
// _tv_total_cnt++;
_tv_total_sz += sizeof(*tv);
if (0 == ((size_t) (_tv_total_cnt.fetch_add(1))) % 100000) {
printf("duuude its pend cnt=%lu (%lu) tot=%lu (%lu)\n",
(size_t) _tv_pend_cnt, (size_t) _tv_pend_sz, (size_t) _tv_total_cnt,
(size_t) _tv_total_sz);
logger().info("duuude its pend cnt=%lu (%lu) tot=%lu (%lu)",
(size_t) _tv_pend_cnt, (size_t) _tv_pend_sz, (size_t) _tv_total_cnt,
(size_t) _tv_total_sz);
}
OK, so its not the TV ... (not the TV in guile gc)
So maybe its prim environ?? Nooo not that either
Maybe handles?? (in guile) no its not that. (not in guile gc)
well, its not the TLB...
and not the atoms ... TLB has 400K entries, with 18MB of pairs
atoms allocated are 466286 for 63414896 = 63MBytes but guile is
1.6GB resident, 9.3GB virt.... wtf...
each atom is 136 MB ex tv.
2.9 gb resident, but 1.6M atoms for 215MB size, and 64MB of tlb contents
what about atomspace? only 19922 atoms in atomspace...
heap size is 2GB ...
Maybe stub out capture-stack? it was the cuplrit before...
Nope seems to make no difference.
are we leaking SCM values somwhere? How?
misc_to_string ? no, code audit.
scm_to_utf8_string no, code audit...
--------------------------------------------------------------
try guile-2.2 from git
Great. that seg-faults... maybe some other version doesn't ...
try 2.1.5 ? 2.1.4 ? No, because even though it segfaulted
it did seem to also grow.
Seg-faults twice in a row, within 10 minutes wall-clock time
(about 36 mins cpu time).
---------------------------------------------------------------
Try below. ... It does not leak.
(use-modules (opencog) (opencog cogserver))
(start-cogserver)
(define (slu)
(define cnt 0)
(define start (- (current-time) 0.1))
(define (mka)
(if (eq? 0 (modulo cnt 100000))
(begin (display "rate=")
(display (/ cnt (- (current-time) start))) (newline)
(cog-map-type (lambda (ato) (cog-extract ato) #f) 'ListLink)
(cog-map-type (lambda (ato) (cog-extract ato) #f) 'ConceptNode)
))
(set! cnt (+ cnt 1))
(ListLink
(ConceptNode (string-append "concepto " (number->string cnt )))
(ConceptNode (string-append "glorg " (number->string cnt )))))
(define (aloo) (mka) (aloo))
(aloo)
)
(count-all)
(cog-report-counts)
(gc-stats)
((gc-time-taken . 6660684289) (heap-size . 15646720) (heap-free-size . 3055616)
(heap-total-allocated . 861940880) (heap-allocated-since-gc . 5203440)
(protected-objects . 7) (gc-times . 298))
((gc-time-taken . 15414762477) (heap-size . 16101376) (heap-free-size . 2859008)
(heap-total-allocated . 2665977008) (heap-allocated-since-gc . 5064624)
(protected-objects . 7) (gc-times . 562))
rate=47.3K (concept only)
rate=15.7K (listlinks+concepts)
---------------------------------------------------------------
/tmp/bang.sh
#!/bin/bash
i=0
while true ; do
let i=$i+1
if [ "$(($i % 2000))" -eq "0" ] ; then
echo loop $i
fi
echo '(display ctr)' | nc localhost 17001
# echo '(NumberNode ctr)' | nc 10.0.3.239 17001
# echo '(NumberNode' $i ')' | nc 10.0.3.239 17001
# echo '(NumberNode 42)' | nc localhost 17001
echo '(ConceptNode "fooo ' $i $$ ' you too")' | nc localhost 17001 >> /dev/null
done
run 10 copies of above.
--- no leak ... and no crash... so this is very stable. wtf.
---------------------------------------------------------------
OK, so lets try the full pipeline.
but without updates
Whoops. Its blowing up
((gc-time-taken . 8772615322) (heap-size . 820801536) (heap-free-size . 90624000)