PyTorch version: 2.7.0+cpu ZenTorch version: 5.1.0 Transformers version: 4.54.1 Loading pre-quantized model... Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.80it/s] ✓ Pre-quantized model loaded successfully. Loading tokenizer... Applying ZenTorch 5.1 optimizations... [WARNING zentorch.llm._checks - essential_checks:68] The supported datatype for the most optimal performance with zentorch is torch.bfloat16. ✓ ZenTorch optimization completed successfully Processing question: What is 2+2? Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. [API:I][0.000011] CPU Engine create [CORE:V0][0.000001] CPU Engine created [engine] [CORE:I][0.000008] CPU Engine created [cpu/engine] [API:I][0.000001] Memory create [CORE:V0][0.000001] Memory desc init by tag [memory] [CORE:I][0.000026] Memory created [memory] [API:I][0.000048] Memory create [CORE:V0][0.000033] Memory desc init by tag [memory] [CORE:I][0.000038] Memory created [memory] [API:I][0.000060] Memory create [CORE:V0][0.000045] Memory desc init by tag [memory] [CORE:I][0.000051] Memory created [memory] [API:I][0.000001] matmul desc create - no bias [CORE:I][0.000007] matmul desc init [matmul] [CORE:I][0.000000] CPU Engine: primitive_cache_capacity: 1024 [CORE:V0][0.000001] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][0.000144] Memory desc init by tag [memory] [CORE:V0][0.000148] Memory desc init by tag [memory] [CORE:V0][0.000152] Memory desc init by tag [memory] ZenDNN Info: Execution has entered the ZenDNN library. Optimized deep learning kernels are now active for high-performance inference on AMD CPUs. [CORE:V0][0.000001] ZenDNN Ref gemm_f32_matmul_t::pd_t::init() [CORE:V0][0.000014] ZenDNN Ref gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][0.000198] matmul primitive_desc create - attr [PROF:I][0.000014] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_bmm,matmul,gemm:jit,undef,src_f32::blocked:abc:f0 wei_f32::blocked:abc:f0 dst_f32::blocked:abc:f0,,,1x64x1:1x1x8:1x64x8,0.020681,ms [API:I][0.000248] matmul primitive create [API:I][0.000250] CPU Stream create [CORE:I][0.000000] CPU Stream created [stream] [CORE:V0][0.000000] CPU Stream created [cpu/stream] [CORE:I][0.000115] ZenDNN Ref gemm_f32_matmul_t::execute_ref [PROF:I][0.014200] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_bmm,matmul,gemm:jit,undef,src_f32::blocked:abc:f0 wei_f32::blocked:abc:f0 dst_f32::blocked:abc:f0,,,1x64x1:1x1x8:1x64x8,14.1607,ms [API:I][0.014649] Memory create [CORE:V0][0.014640] Memory desc init by tag [memory] [CORE:I][0.014646] Memory created [memory] [API:I][0.014668] Memory create - strides [CORE:I][0.014655] Memory desc init by Stride [memory] [CORE:I][0.014659] Memory created [memory] [API:I][0.014681] Memory create [CORE:V0][0.014668] Memory desc init by tag [memory] [CORE:I][0.014671] Memory created [memory] [API:I][0.014694] Memory create [CORE:V0][0.014679] Memory desc init by tag [memory] [CORE:I][0.014683] Memory created [memory] [API:I][0.014706] Memory create - strides [CORE:I][0.014692] Memory desc init by Stride [memory] [CORE:I][0.014696] Memory created [memory] [API:I][0.014718] Memory create [CORE:V0][0.014703] Memory desc init by tag [memory] [CORE:I][0.014709] Memory created [memory] [API:I][0.014731] Memory create [CORE:V0][0.014717] Memory desc init by tag [memory] [CORE:I][0.014721] Memory created [memory] [API:I][0.014742] Memory create - strides [CORE:I][0.014728] Memory desc init by Stride [memory] [CORE:I][0.014731] Memory created [memory] [API:I][0.014753] Memory create [CORE:V0][0.014738] Memory desc init by tag [memory] [CORE:I][0.014743] Memory created [memory] [API:I][0.000003] CPU Engine create [CORE:V0][0.014824] CPU Engine created [engine] [CORE:I][0.014829] CPU Engine created [cpu/engine] [API:I][0.000014] CPU Stream create [CORE:I][0.014436] CPU Stream created [stream] [CORE:V0][0.014435] CPU Stream created [cpu/stream] [API:I][0.000028] matmul desc create - no bias [CORE:I][0.014699] matmul desc init [matmul] [CORE:V0][0.014661] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][0.014793] Memory desc init by tag [memory] [CORE:V0][0.014797] Memory desc init by tag [memory] [CORE:V0][0.014801] Memory desc init by tag [memory] [CORE:V0][0.014804] Memory desc init by tag [memory] [CORE:V0][0.014682] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][0.000067] matmul primitive_desc create - attr [PROF:I][0.014534] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00664,ms [API:I][0.000094] matmul primitive create [CORE:I][0.014722] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.014725] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.000001] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.001080] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.081ms graph_exe_count=-1 weight_address=0x70de81ffb040 [PROF:I][0.015672] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,1.12648,ms [API:I][0.001238] matmul desc create - no bias [CORE:I][0.015909] matmul desc init [matmul] [CORE:V0][0.015871] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][0.016005] Memory desc init by tag [memory] [CORE:V0][0.016008] Memory desc init by tag [memory] [CORE:V0][0.016012] Memory desc init by tag [memory] [CORE:V0][0.016024] Memory desc init by tag [memory] [CORE:V0][0.015900] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][0.001285] matmul primitive_desc create - attr [PROF:I][0.015749] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.006011,ms [API:I][0.001309] matmul primitive create [CORE:I][0.015936] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.015939] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.001210] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.001394] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.184ms graph_exe_count=-1 weight_address=0x1514d7c0 [PROF:I][0.015964] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.204907,ms [API:I][0.001527] matmul desc create - no bias [CORE:I][0.016197] matmul desc init [matmul] [API:I][0.001536] matmul primitive_desc create - attr [PROF:I][0.015996] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00204,ms [API:I][0.001554] matmul primitive create [CORE:I][0.016179] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.016182] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.001452] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.001618] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.167ms graph_exe_count=-1 weight_address=0x1614d800 [PROF:I][0.016188] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.184286,ms [PROF:V0][0.001747] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.74805,ms [CORE:I][0.016171] CPU Stream deleted [stream] [CORE:I][0.016575] CPU Engine deleted [engine] [API:I][0.017141] Memory create [CORE:V0][0.017131] Memory desc init by tag [memory] [CORE:I][0.017137] Memory created [memory] [API:I][0.017160] Memory create - strides [CORE:I][0.017146] Memory desc init by Stride [memory] [CORE:I][0.017150] Memory created [memory] [API:I][0.017171] Memory create [CORE:V0][0.017157] Memory desc init by tag [memory] [CORE:I][0.017161] Memory created [memory] [API:I][0.017090] matmul desc create - no bias [CORE:I][0.017088] matmul desc init [matmul] [CORE:V0][0.017050] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][0.017182] Memory desc init by tag [memory] [CORE:V0][0.017186] Memory desc init by tag [memory] [CORE:V0][0.017189] Memory desc init by tag [memory] [CORE:V0][0.017194] Memory desc init by tag [memory] [CORE:V0][0.017071] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][0.017127] matmul primitive_desc create - attr [PROF:I][0.016918] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00576,ms [API:I][0.017150] matmul primitive create [CORE:I][0.017104] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.017108] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.002379] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.002922] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.544ms graph_exe_count=-1 weight_address=0x70de85ffc040 [PROF:I][0.017492] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.564158,ms [API:I][0.017909] Memory create [CORE:V0][0.017896] Memory desc init by tag [memory] [CORE:I][0.017901] Memory created [memory] [API:I][0.017922] Memory create - strides [CORE:I][0.017907] Memory desc init by Stride [memory] [CORE:I][0.017913] Memory created [memory] [API:I][0.017934] Memory create [CORE:V0][0.017920] Memory desc init by tag [memory] [CORE:I][0.017924] Memory created [memory] [API:I][0.017853] matmul desc create - no bias [CORE:I][0.017851] matmul desc init [matmul] [CORE:V0][0.017811] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][0.017943] Memory desc init by tag [memory] [CORE:V0][0.017957] Memory desc init by tag [memory] [CORE:V0][0.017961] Memory desc init by tag [memory] [CORE:V0][0.017963] Memory desc init by tag [memory] [CORE:V0][0.017839] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][0.017896] matmul primitive_desc create - attr [PROF:I][0.017687] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.005371,ms [API:I][0.017919] matmul primitive create [CORE:I][0.017871] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.017874] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.003145] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.005262] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=2.117ms graph_exe_count=-1 weight_address=0x70de89ffd040 [PROF:I][0.019832] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,2.1364,ms [API:I][0.020191] Memory create [CORE:V0][0.020178] Memory desc init by tag [memory] [CORE:I][0.020183] Memory created [memory] [API:I][0.020204] Memory create - strides [CORE:I][0.020190] Memory desc init by Stride [memory] [CORE:I][0.020195] Memory created [memory] [API:I][0.020218] Memory create [CORE:V0][0.020203] Memory desc init by tag [memory] [CORE:I][0.020207] Memory created [memory] [API:I][0.020230] Memory create [CORE:V0][0.020215] Memory desc init by tag [memory] [CORE:I][0.020219] Memory created [memory] [API:I][0.020242] Memory create [CORE:V0][0.020228] Memory desc init by tag [memory] [CORE:I][0.020233] Memory created [memory] [API:I][0.020164] matmul desc create - no bias [CORE:I][0.020162] matmul desc init [matmul] [CORE:V0][0.020126] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][0.020259] Memory desc init by tag [memory] [CORE:V0][0.020262] Memory desc init by tag [memory] [CORE:V0][0.020266] Memory desc init by tag [memory] [CORE:V0][0.020269] Memory desc init by tag [memory] [CORE:V0][0.020145] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][0.020204] matmul primitive_desc create - attr [PROF:I][0.020355] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.359442,ms [API:I][0.020589] matmul primitive create [CORE:I][0.020545] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.020548] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.005822] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.008636] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=2.816ms graph_exe_count=-1 weight_address=0x70de97ffe040 [PROF:I][0.023208] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,2.83956,ms [API:I][0.023582] Memory create [CORE:V0][0.023569] Memory desc init by tag [memory] [CORE:I][0.023575] Memory created [memory] [API:I][0.023596] Memory create - strides [CORE:I][0.023583] Memory desc init by Stride [memory] [CORE:I][0.023587] Memory created [memory] [API:I][0.023609] Memory create [CORE:V0][0.023595] Memory desc init by tag [memory] [CORE:I][0.023599] Memory created [memory] [API:I][0.023529] matmul desc create - no bias [CORE:I][0.023527] matmul desc init [matmul] [CORE:V0][0.023489] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][0.023622] Memory desc init by tag [memory] [CORE:V0][0.023625] Memory desc init by tag [memory] [CORE:V0][0.023629] Memory desc init by tag [memory] [CORE:V0][0.023632] Memory desc init by tag [memory] [CORE:V0][0.023510] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][0.023568] matmul primitive_desc create - attr [PROF:I][0.023358] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00561,ms [API:I][0.023590] matmul primitive create [CORE:I][0.023543] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.023546] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.008818] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.010538] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.72ms graph_exe_count=-1 weight_address=0x70dea5fff040 [PROF:I][0.025109] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.74151,ms [API:I][0.025579] Memory create [CORE:V0][0.025568] Memory desc init by tag [memory] [CORE:I][0.025573] Memory created [memory] [API:I][0.025595] Memory create - strides [CORE:I][0.025581] Memory desc init by Stride [memory] [CORE:I][0.025585] Memory created [memory] [API:I][0.025606] Memory create [CORE:V0][0.025593] Memory desc init by tag [memory] [CORE:I][0.025597] Memory created [memory] [API:I][0.025619] Memory create [CORE:V0][0.025605] Memory desc init by tag [memory] [CORE:I][0.025609] Memory created [memory] [API:I][0.025631] Memory create - strides [CORE:I][0.025616] Memory desc init by Stride [memory] [CORE:I][0.025621] Memory created [memory] [API:I][0.025641] Memory create [CORE:V0][0.025627] Memory desc init by tag [memory] [CORE:I][0.025633] Memory created [memory] [API:I][0.025655] Memory create [CORE:V0][0.025641] Memory desc init by tag [memory] [CORE:I][0.025645] Memory created [memory] [API:I][0.025667] Memory create - strides [CORE:I][0.025652] Memory desc init by Stride [memory] [CORE:I][0.025657] Memory created [memory] [API:I][0.025680] Memory create [CORE:V0][0.025667] Memory desc init by tag [memory] [CORE:I][0.025670] Memory created [memory] [API:I][0.010926] CPU Engine create [CORE:V0][0.025748] CPU Engine created [engine] [CORE:I][0.025754] CPU Engine created [cpu/engine] [API:I][0.010941] CPU Stream create [CORE:I][0.025364] CPU Stream created [stream] [CORE:V0][0.025363] CPU Stream created [cpu/stream] [API:I][0.010958] matmul desc create - no bias [CORE:I][0.025628] matmul desc init [matmul] [API:I][0.010970] matmul primitive_desc create - attr [PROF:I][0.025424] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00231,ms [API:I][0.010983] matmul primitive create [CORE:I][0.025610] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.025613] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.010886] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.011433] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.547ms graph_exe_count=-1 weight_address=0x70de4fff6040 [PROF:I][0.026003] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.568939,ms [API:I][0.011564] matmul desc create - no bias [CORE:I][0.026234] matmul desc init [matmul] [API:I][0.011575] matmul primitive_desc create - attr [PROF:I][0.026028] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00188,ms [API:I][0.011586] matmul primitive create [CORE:I][0.026211] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.026214] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.011485] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.011637] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.153ms graph_exe_count=-1 weight_address=0x171558c0 [PROF:I][0.026206] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.169935,ms [API:I][0.011766] matmul desc create - no bias [CORE:I][0.026436] matmul desc init [matmul] [API:I][0.011775] matmul primitive_desc create - attr [PROF:I][0.026226] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.0009,ms [API:I][0.011785] matmul primitive create [CORE:I][0.026411] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.026414] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.011684] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.011824] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.14ms graph_exe_count=-1 weight_address=0x18155900 [PROF:I][0.026393] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.157456,ms [PROF:V0][0.011953] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.02686,ms [CORE:I][0.026376] CPU Stream deleted [stream] [CORE:I][0.026781] CPU Engine deleted [engine] [API:I][0.027075] Memory create [CORE:V0][0.027064] Memory desc init by tag [memory] [CORE:I][0.027069] Memory created [memory] [API:I][0.027091] Memory create - strides [CORE:I][0.027076] Memory desc init by Stride [memory] [CORE:I][0.027080] Memory created [memory] [API:I][0.027103] Memory create [CORE:V0][0.027088] Memory desc init by tag [memory] [CORE:I][0.027092] Memory created [memory] [API:I][0.027022] matmul desc create - no bias [CORE:I][0.027020] matmul desc init [matmul] [API:I][0.027033] matmul primitive_desc create - attr [PROF:I][0.026815] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00203,ms [API:I][0.027046] matmul primitive create [CORE:I][0.026999] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.027003] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.012274] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.012859] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.585ms graph_exe_count=-1 weight_address=0x70de53ff7040 [PROF:I][0.027429] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.60572,ms [API:I][0.027857] Memory create [CORE:V0][0.027844] Memory desc init by tag [memory] [CORE:I][0.027849] Memory created [memory] [API:I][0.027870] Memory create - strides [CORE:I][0.027856] Memory desc init by Stride [memory] [CORE:I][0.027860] Memory created [memory] [API:I][0.027881] Memory create [CORE:V0][0.027866] Memory desc init by tag [memory] [CORE:I][0.027870] Memory created [memory] [API:I][0.027798] matmul desc create - no bias [CORE:I][0.027795] matmul desc init [matmul] [API:I][0.027808] matmul primitive_desc create - attr [PROF:I][0.027588] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00143,ms [API:I][0.027819] matmul primitive create [CORE:I][0.027772] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.027776] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.013046] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.014818] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.772ms graph_exe_count=-1 weight_address=0x70de57ff8040 [PROF:I][0.029389] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.79358,ms [API:I][0.029747] Memory create [CORE:V0][0.029734] Memory desc init by tag [memory] [CORE:I][0.029738] Memory created [memory] [API:I][0.029759] Memory create - strides [CORE:I][0.029745] Memory desc init by Stride [memory] [CORE:I][0.029750] Memory created [memory] [API:I][0.029771] Memory create [CORE:V0][0.029757] Memory desc init by tag [memory] [CORE:I][0.029760] Memory created [memory] [API:I][0.029783] Memory create [CORE:V0][0.029768] Memory desc init by tag [memory] [CORE:I][0.029772] Memory created [memory] [API:I][0.029796] Memory create [CORE:V0][0.029781] Memory desc init by tag [memory] [CORE:I][0.029786] Memory created [memory] [API:I][0.029716] matmul desc create - no bias [CORE:I][0.029714] matmul desc init [matmul] [API:I][0.029728] matmul primitive_desc create - attr [PROF:I][0.029508] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00209,ms [API:I][0.029740] matmul primitive create [CORE:I][0.029693] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.029697] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.014969] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.016763] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.795ms graph_exe_count=-1 weight_address=0x70de65ff9040 [PROF:I][0.031334] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.81606,ms [API:I][0.031697] Memory create [CORE:V0][0.031684] Memory desc init by tag [memory] [CORE:I][0.031689] Memory created [memory] [API:I][0.031710] Memory create - strides [CORE:I][0.031696] Memory desc init by Stride [memory] [CORE:I][0.031701] Memory created [memory] [API:I][0.031722] Memory create [CORE:V0][0.031708] Memory desc init by tag [memory] [CORE:I][0.031712] Memory created [memory] [API:I][0.031643] matmul desc create - no bias [CORE:I][0.031641] matmul desc init [matmul] [API:I][0.031657] matmul primitive_desc create - attr [PROF:I][0.031438] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00206,ms [API:I][0.031670] matmul primitive create [CORE:I][0.031622] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.031625] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.016897] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.018472] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.576ms graph_exe_count=-1 weight_address=0x70de73ffa040 [PROF:I][0.033043] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.59631,ms [API:I][0.033505] Memory create [CORE:V0][0.033494] Memory desc init by tag [memory] [CORE:I][0.033499] Memory created [memory] [API:I][0.033520] Memory create - strides [CORE:I][0.033507] Memory desc init by Stride [memory] [CORE:I][0.033511] Memory created [memory] [API:I][0.033532] Memory create [CORE:V0][0.033518] Memory desc init by tag [memory] [CORE:I][0.033522] Memory created [memory] [API:I][0.033544] Memory create [CORE:V0][0.033530] Memory desc init by tag [memory] [CORE:I][0.033534] Memory created [memory] [API:I][0.033555] Memory create - strides [CORE:I][0.033540] Memory desc init by Stride [memory] [CORE:I][0.033544] Memory created [memory] [API:I][0.033567] Memory create [CORE:V0][0.033554] Memory desc init by tag [memory] [CORE:I][0.033558] Memory created [memory] [API:I][0.033580] Memory create [CORE:V0][0.033565] Memory desc init by tag [memory] [CORE:I][0.033569] Memory created [memory] [API:I][0.033590] Memory create - strides [CORE:I][0.033575] Memory desc init by Stride [memory] [CORE:I][0.033580] Memory created [memory] [API:I][0.033601] Memory create [CORE:V0][0.033586] Memory desc init by tag [memory] [CORE:I][0.033590] Memory created [memory] [API:I][0.018845] CPU Engine create [CORE:V0][0.033668] CPU Engine created [engine] [CORE:I][0.033671] CPU Engine created [cpu/engine] [API:I][0.018858] CPU Stream create [CORE:I][0.033281] CPU Stream created [stream] [CORE:V0][0.033281] CPU Stream created [cpu/stream] [API:I][0.018874] matmul desc create - no bias [CORE:I][0.033545] matmul desc init [matmul] [API:I][0.018888] matmul primitive_desc create - attr [PROF:I][0.033342] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.0024,ms [API:I][0.018901] matmul primitive create [CORE:I][0.033527] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.033531] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.018805] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.019296] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.492ms graph_exe_count=-1 weight_address=0x70ddf9fed040 [PROF:I][0.033867] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.515577,ms [API:I][0.019428] matmul desc create - no bias [CORE:I][0.034098] matmul desc init [matmul] [API:I][0.019439] matmul primitive_desc create - attr [PROF:I][0.033891] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00136,ms [API:I][0.019450] matmul primitive create [CORE:I][0.034076] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.034079] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.019349] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.019491] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.142ms graph_exe_count=-1 weight_address=0x1b15da40 [PROF:I][0.034060] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.160166,ms [API:I][0.019621] matmul desc create - no bias [CORE:I][0.034290] matmul desc init [matmul] [API:I][0.019628] matmul primitive_desc create - attr [PROF:I][0.034080] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00139,ms [API:I][0.019639] matmul primitive create [CORE:I][0.034263] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.034266] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.019537] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.019680] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.144ms graph_exe_count=-1 weight_address=0x1c15da80 [PROF:I][0.034250] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.162465,ms [PROF:V0][0.019810] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,0.964111,ms [CORE:I][0.034234] CPU Stream deleted [stream] [CORE:I][0.034638] CPU Engine deleted [engine] [API:I][0.034905] Memory create [CORE:V0][0.034894] Memory desc init by tag [memory] [CORE:I][0.034899] Memory created [memory] [API:I][0.034921] Memory create - strides [CORE:I][0.034907] Memory desc init by Stride [memory] [CORE:I][0.034910] Memory created [memory] [API:I][0.034931] Memory create [CORE:V0][0.034916] Memory desc init by tag [memory] [CORE:I][0.034919] Memory created [memory] [API:I][0.034849] matmul desc create - no bias [CORE:I][0.034846] matmul desc init [matmul] [API:I][0.034860] matmul primitive_desc create - attr [PROF:I][0.034640] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00175,ms [API:I][0.034882] matmul primitive create [CORE:I][0.034837] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.034840] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.020111] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.020617] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.505ms graph_exe_count=-1 weight_address=0x70ddfdfee040 [PROF:I][0.035187] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.526268,ms [API:I][0.035588] Memory create [CORE:V0][0.035575] Memory desc init by tag [memory] [CORE:I][0.035579] Memory created [memory] [API:I][0.035600] Memory create - strides [CORE:I][0.035587] Memory desc init by Stride [memory] [CORE:I][0.035591] Memory created [memory] [API:I][0.035612] Memory create [CORE:V0][0.035597] Memory desc init by tag [memory] [CORE:I][0.035602] Memory created [memory] [API:I][0.035533] matmul desc create - no bias [CORE:I][0.035530] matmul desc init [matmul] [API:I][0.035544] matmul primitive_desc create - attr [PROF:I][0.035323] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00136,ms [API:I][0.035554] matmul primitive create [CORE:I][0.035509] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.035513] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.020783] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.022501] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.718ms graph_exe_count=-1 weight_address=0x70de01fef040 [PROF:I][0.037072] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.73928,ms [API:I][0.037431] Memory create [CORE:V0][0.037418] Memory desc init by tag [memory] [CORE:I][0.037422] Memory created [memory] [API:I][0.037443] Memory create - strides [CORE:I][0.037428] Memory desc init by Stride [memory] [CORE:I][0.037433] Memory created [memory] [API:I][0.037455] Memory create [CORE:V0][0.037442] Memory desc init by tag [memory] [CORE:I][0.037446] Memory created [memory] [API:I][0.037468] Memory create [CORE:V0][0.037453] Memory desc init by tag [memory] [CORE:I][0.037457] Memory created [memory] [API:I][0.037481] Memory create [CORE:V0][0.037466] Memory desc init by tag [memory] [CORE:I][0.037470] Memory created [memory] [API:I][0.037401] matmul desc create - no bias [CORE:I][0.037398] matmul desc init [matmul] [API:I][0.037413] matmul primitive_desc create - attr [PROF:I][0.037194] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00165,ms [API:I][0.037427] matmul primitive create [CORE:I][0.037381] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.037384] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.022656] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.024453] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.797ms graph_exe_count=-1 weight_address=0x70de0fff0040 [PROF:I][0.039023] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.81764,ms [API:I][0.039385] Memory create [CORE:V0][0.039371] Memory desc init by tag [memory] [CORE:I][0.039376] Memory created [memory] [API:I][0.039397] Memory create - strides [CORE:I][0.039382] Memory desc init by Stride [memory] [CORE:I][0.039386] Memory created [memory] [API:I][0.039407] Memory create [CORE:V0][0.039392] Memory desc init by tag [memory] [CORE:I][0.039396] Memory created [memory] [API:I][0.039325] matmul desc create - no bias [CORE:I][0.039322] matmul desc init [matmul] [API:I][0.039337] matmul primitive_desc create - attr [PROF:I][0.039117] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00162,ms [API:I][0.039348] matmul primitive create [CORE:I][0.039303] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.039307] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.024577] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.026214] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.637ms graph_exe_count=-1 weight_address=0x70de1dff1040 [PROF:I][0.040785] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.65814,ms [API:I][0.041267] Memory create [CORE:V0][0.041256] Memory desc init by tag [memory] [CORE:I][0.041261] Memory created [memory] [API:I][0.041282] Memory create - strides [CORE:I][0.041267] Memory desc init by Stride [memory] [CORE:I][0.041272] Memory created [memory] [API:I][0.041293] Memory create [CORE:V0][0.041279] Memory desc init by tag [memory] [CORE:I][0.041283] Memory created [memory] [API:I][0.041305] Memory create [CORE:V0][0.041290] Memory desc init by tag [memory] [CORE:I][0.041294] Memory created [memory] [API:I][0.041315] Memory create - strides [CORE:I][0.041300] Memory desc init by Stride [memory] [CORE:I][0.041303] Memory created [memory] [API:I][0.041325] Memory create [CORE:V0][0.041310] Memory desc init by tag [memory] [CORE:I][0.041313] Memory created [memory] [API:I][0.041335] Memory create [CORE:V0][0.041321] Memory desc init by tag [memory] [CORE:I][0.041325] Memory created [memory] [API:I][0.041345] Memory create - strides [CORE:I][0.041331] Memory desc init by Stride [memory] [CORE:I][0.041335] Memory created [memory] [API:I][0.041355] Memory create [CORE:V0][0.041340] Memory desc init by tag [memory] [CORE:I][0.041344] Memory created [memory] [API:I][0.026600] CPU Engine create [CORE:V0][0.041422] CPU Engine created [engine] [CORE:I][0.041426] CPU Engine created [cpu/engine] [API:I][0.026611] CPU Stream create [CORE:I][0.041033] CPU Stream created [stream] [CORE:V0][0.041032] CPU Stream created [cpu/stream] [API:I][0.026626] matmul desc create - no bias [CORE:I][0.041296] matmul desc init [matmul] [API:I][0.026639] matmul primitive_desc create - attr [PROF:I][0.041092] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00181,ms [API:I][0.026651] matmul primitive create [CORE:I][0.041278] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.041282] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.026555] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.027049] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.494ms graph_exe_count=-1 weight_address=0x70ddc7fe8040 [PROF:I][0.041620] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.518197,ms [API:I][0.027181] matmul desc create - no bias [CORE:I][0.041852] matmul desc init [matmul] [API:I][0.027191] matmul primitive_desc create - attr [PROF:I][0.041655] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00128,ms [API:I][0.027214] matmul primitive create [CORE:I][0.041838] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.041843] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.027114] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.027257] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.144ms graph_exe_count=-1 weight_address=0x1d165b40 [PROF:I][0.041826] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.163535,ms [API:I][0.027388] matmul desc create - no bias [CORE:I][0.042057] matmul desc init [matmul] [API:I][0.027397] matmul primitive_desc create - attr [PROF:I][0.041848] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00098,ms [API:I][0.027407] matmul primitive create [CORE:I][0.042032] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.042036] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.027306] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.027454] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.148ms graph_exe_count=-1 weight_address=0x1e165b80 [PROF:I][0.042024] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.167056,ms [PROF:V0][0.027584] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,0.98291,ms [CORE:I][0.042008] CPU Stream deleted [stream] [CORE:I][0.042412] CPU Engine deleted [engine] [API:I][0.042690] Memory create [CORE:V0][0.042679] Memory desc init by tag [memory] [CORE:I][0.042684] Memory created [memory] [API:I][0.042705] Memory create - strides [CORE:I][0.042691] Memory desc init by Stride [memory] [CORE:I][0.042694] Memory created [memory] [API:I][0.042715] Memory create [CORE:V0][0.042702] Memory desc init by tag [memory] [CORE:I][0.042706] Memory created [memory] [API:I][0.042636] matmul desc create - no bias [CORE:I][0.042633] matmul desc init [matmul] [API:I][0.042648] matmul primitive_desc create - attr [PROF:I][0.042428] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00172,ms [API:I][0.042660] matmul primitive create [CORE:I][0.042613] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.042617] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.027888] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.028442] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.554ms graph_exe_count=-1 weight_address=0x70ddcbfe9040 [PROF:I][0.043012] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.575519,ms [API:I][0.043432] Memory create [CORE:V0][0.043419] Memory desc init by tag [memory] [CORE:I][0.043424] Memory created [memory] [API:I][0.043445] Memory create - strides [CORE:I][0.043430] Memory desc init by Stride [memory] [CORE:I][0.043434] Memory created [memory] [API:I][0.043455] Memory create [CORE:V0][0.043440] Memory desc init by tag [memory] [CORE:I][0.043444] Memory created [memory] [API:I][0.043372] matmul desc create - no bias [CORE:I][0.043370] matmul desc init [matmul] [API:I][0.043383] matmul primitive_desc create - attr [PROF:I][0.043163] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00157,ms [API:I][0.043394] matmul primitive create [CORE:I][0.043348] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.043351] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.028622] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.030435] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.813ms graph_exe_count=-1 weight_address=0x70ddcffea040 [PROF:I][0.045006] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.83425,ms [API:I][0.045364] Memory create [CORE:V0][0.045351] Memory desc init by tag [memory] [CORE:I][0.045355] Memory created [memory] [API:I][0.045376] Memory create - strides [CORE:I][0.045362] Memory desc init by Stride [memory] [CORE:I][0.045368] Memory created [memory] [API:I][0.045389] Memory create [CORE:V0][0.045374] Memory desc init by tag [memory] [CORE:I][0.045377] Memory created [memory] [API:I][0.045400] Memory create [CORE:V0][0.045385] Memory desc init by tag [memory] [CORE:I][0.045389] Memory created [memory] [API:I][0.045412] Memory create [CORE:V0][0.045397] Memory desc init by tag [memory] [CORE:I][0.045401] Memory created [memory] [API:I][0.045332] matmul desc create - no bias [CORE:I][0.045329] matmul desc init [matmul] [API:I][0.045344] matmul primitive_desc create - attr [PROF:I][0.045124] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00153,ms [API:I][0.045357] matmul primitive create [CORE:I][0.045311] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.045314] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.030586] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.032459] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.875ms graph_exe_count=-1 weight_address=0x70ddddfeb040 [PROF:I][0.047029] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.89426,ms [API:I][0.047391] Memory create [CORE:V0][0.047378] Memory desc init by tag [memory] [CORE:I][0.047383] Memory created [memory] [API:I][0.047404] Memory create - strides [CORE:I][0.047390] Memory desc init by Stride [memory] [CORE:I][0.047396] Memory created [memory] [API:I][0.047417] Memory create [CORE:V0][0.047402] Memory desc init by tag [memory] [CORE:I][0.047407] Memory created [memory] [API:I][0.047336] matmul desc create - no bias [CORE:I][0.047333] matmul desc init [matmul] [API:I][0.047347] matmul primitive_desc create - attr [PROF:I][0.047126] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00128,ms [API:I][0.047358] matmul primitive create [CORE:I][0.047311] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.047315] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.032586] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.034159] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.573ms graph_exe_count=-1 weight_address=0x70ddebfec040 [PROF:I][0.048729] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.5933,ms [API:I][0.049179] Memory create [CORE:V0][0.049167] Memory desc init by tag [memory] [CORE:I][0.049172] Memory created [memory] [API:I][0.049194] Memory create - strides [CORE:I][0.049179] Memory desc init by Stride [memory] [CORE:I][0.049184] Memory created [memory] [API:I][0.049205] Memory create [CORE:V0][0.049192] Memory desc init by tag [memory] [CORE:I][0.049196] Memory created [memory] [API:I][0.049218] Memory create [CORE:V0][0.049204] Memory desc init by tag [memory] [CORE:I][0.049208] Memory created [memory] [API:I][0.049230] Memory create - strides [CORE:I][0.049215] Memory desc init by Stride [memory] [CORE:I][0.049219] Memory created [memory] [API:I][0.049240] Memory create [CORE:V0][0.049228] Memory desc init by tag [memory] [CORE:I][0.049231] Memory created [memory] [API:I][0.049254] Memory create [CORE:V0][0.049240] Memory desc init by tag [memory] [CORE:I][0.049243] Memory created [memory] [API:I][0.049264] Memory create - strides [CORE:I][0.049249] Memory desc init by Stride [memory] [CORE:I][0.049253] Memory created [memory] [API:I][0.049274] Memory create [CORE:V0][0.049260] Memory desc init by tag [memory] [CORE:I][0.049264] Memory created [memory] [API:I][0.034519] CPU Engine create [CORE:V0][0.049341] CPU Engine created [engine] [CORE:I][0.049345] CPU Engine created [cpu/engine] [API:I][0.034531] CPU Stream create [CORE:I][0.048952] CPU Stream created [stream] [CORE:V0][0.048951] CPU Stream created [cpu/stream] [API:I][0.034545] matmul desc create - no bias [CORE:I][0.049215] matmul desc init [matmul] [API:I][0.034558] matmul primitive_desc create - attr [PROF:I][0.049011] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00256,ms [API:I][0.034571] matmul primitive create [CORE:I][0.049196] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.049199] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.034472] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.035016] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.543ms graph_exe_count=-1 weight_address=0x70dd95fe3040 [PROF:I][0.049586] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.565508,ms [API:I][0.035148] matmul desc create - no bias [CORE:I][0.049818] matmul desc init [matmul] [API:I][0.035158] matmul primitive_desc create - attr [PROF:I][0.049610] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00144,ms [API:I][0.035169] matmul primitive create [CORE:I][0.049794] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.049798] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.035068] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.035219] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.151ms graph_exe_count=-1 weight_address=0x1f16dc40 [PROF:I][0.049789] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.170276,ms [API:I][0.035350] matmul desc create - no bias [CORE:I][0.050020] matmul desc init [matmul] [API:I][0.035360] matmul primitive_desc create - attr [PROF:I][0.049811] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00084,ms [API:I][0.035370] matmul primitive create [CORE:I][0.049996] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.049999] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.035269] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.035411] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.142ms graph_exe_count=-1 weight_address=0x2016dc80 [PROF:I][0.049981] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.160835,ms [PROF:V0][0.035541] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.021,ms [CORE:I][0.049965] CPU Stream deleted [stream] [CORE:I][0.050369] CPU Engine deleted [engine] [API:I][0.050643] Memory create [CORE:V0][0.050631] Memory desc init by tag [memory] [CORE:I][0.050636] Memory created [memory] [API:I][0.050658] Memory create - strides [CORE:I][0.050644] Memory desc init by Stride [memory] [CORE:I][0.050649] Memory created [memory] [API:I][0.050670] Memory create [CORE:V0][0.050655] Memory desc init by tag [memory] [CORE:I][0.050660] Memory created [memory] [API:I][0.050589] matmul desc create - no bias [CORE:I][0.050587] matmul desc init [matmul] [API:I][0.050601] matmul primitive_desc create - attr [PROF:I][0.050382] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.001841,ms [API:I][0.050613] matmul primitive create [CORE:I][0.050567] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.050571] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.035842] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.036363] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.522ms graph_exe_count=-1 weight_address=0x70dd99fe4040 [PROF:I][0.050933] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.542838,ms [API:I][0.051345] Memory create [CORE:V0][0.051332] Memory desc init by tag [memory] [CORE:I][0.051336] Memory created [memory] [API:I][0.051357] Memory create - strides [CORE:I][0.051343] Memory desc init by Stride [memory] [CORE:I][0.051348] Memory created [memory] [API:I][0.051369] Memory create [CORE:V0][0.051355] Memory desc init by tag [memory] [CORE:I][0.051359] Memory created [memory] [API:I][0.051287] matmul desc create - no bias [CORE:I][0.051285] matmul desc init [matmul] [API:I][0.051298] matmul primitive_desc create - attr [PROF:I][0.051078] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.001731,ms [API:I][0.051309] matmul primitive create [CORE:I][0.051263] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.051266] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.036537] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.038240] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.703ms graph_exe_count=-1 weight_address=0x70dd9dfe5040 [PROF:I][0.052811] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.7242,ms [API:I][0.053172] Memory create [CORE:V0][0.053159] Memory desc init by tag [memory] [CORE:I][0.053163] Memory created [memory] [API:I][0.053184] Memory create - strides [CORE:I][0.053170] Memory desc init by Stride [memory] [CORE:I][0.053175] Memory created [memory] [API:I][0.053196] Memory create [CORE:V0][0.053183] Memory desc init by tag [memory] [CORE:I][0.053187] Memory created [memory] [API:I][0.053210] Memory create [CORE:V0][0.053195] Memory desc init by tag [memory] [CORE:I][0.053199] Memory created [memory] [API:I][0.053222] Memory create [CORE:V0][0.053208] Memory desc init by tag [memory] [CORE:I][0.053212] Memory created [memory] [API:I][0.053143] matmul desc create - no bias [CORE:I][0.053141] matmul desc init [matmul] [API:I][0.053155] matmul primitive_desc create - attr [PROF:I][0.052936] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00219,ms [API:I][0.053168] matmul primitive create [CORE:I][0.053122] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.053125] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.038397] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.040208] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.813ms graph_exe_count=-1 weight_address=0x70ddabfe6040 [PROF:I][0.054778] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.83243,ms [API:I][0.055142] Memory create [CORE:V0][0.055129] Memory desc init by tag [memory] [CORE:I][0.055134] Memory created [memory] [API:I][0.055155] Memory create - strides [CORE:I][0.055140] Memory desc init by Stride [memory] [CORE:I][0.055144] Memory created [memory] [API:I][0.055165] Memory create [CORE:V0][0.055151] Memory desc init by tag [memory] [CORE:I][0.055155] Memory created [memory] [API:I][0.055084] matmul desc create - no bias [CORE:I][0.055081] matmul desc init [matmul] [API:I][0.055095] matmul primitive_desc create - attr [PROF:I][0.054876] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00185,ms [API:I][0.055107] matmul primitive create [CORE:I][0.055060] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.055063] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.040334] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.041920] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.586ms graph_exe_count=-1 weight_address=0x70ddb9fe7040 [PROF:I][0.056490] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.60507,ms [API:I][0.056976] Memory create [CORE:V0][0.056965] Memory desc init by tag [memory] [CORE:I][0.056970] Memory created [memory] [API:I][0.056992] Memory create - strides [CORE:I][0.056977] Memory desc init by Stride [memory] [CORE:I][0.056981] Memory created [memory] [API:I][0.057002] Memory create [CORE:V0][0.056988] Memory desc init by tag [memory] [CORE:I][0.056993] Memory created [memory] [API:I][0.057015] Memory create [CORE:V0][0.057002] Memory desc init by tag [memory] [CORE:I][0.057006] Memory created [memory] [API:I][0.057027] Memory create - strides [CORE:I][0.057013] Memory desc init by Stride [memory] [CORE:I][0.057017] Memory created [memory] [API:I][0.057038] Memory create [CORE:V0][0.057023] Memory desc init by tag [memory] [CORE:I][0.057028] Memory created [memory] [API:I][0.057050] Memory create [CORE:V0][0.057035] Memory desc init by tag [memory] [CORE:I][0.057039] Memory created [memory] [API:I][0.057060] Memory create - strides [CORE:I][0.057046] Memory desc init by Stride [memory] [CORE:I][0.057050] Memory created [memory] [API:I][0.057071] Memory create [CORE:V0][0.057057] Memory desc init by tag [memory] [CORE:I][0.057061] Memory created [memory] [API:I][0.042316] CPU Engine create [CORE:V0][0.057138] CPU Engine created [engine] [CORE:I][0.057143] CPU Engine created [cpu/engine] [API:I][0.042329] CPU Stream create [CORE:I][0.056750] CPU Stream created [stream] [CORE:V0][0.056749] CPU Stream created [cpu/stream] [API:I][0.042343] matmul desc create - no bias [CORE:I][0.057014] matmul desc init [matmul] [API:I][0.042355] matmul primitive_desc create - attr [PROF:I][0.056808] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00188,ms [API:I][0.042368] matmul primitive create [CORE:I][0.056993] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.056997] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.042269] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.042781] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.511ms graph_exe_count=-1 weight_address=0x70dd63fde040 [PROF:I][0.057351] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.533938,ms [API:I][0.042913] matmul desc create - no bias [CORE:I][0.057583] matmul desc init [matmul] [API:I][0.042922] matmul primitive_desc create - attr [PROF:I][0.057374] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.0011,ms [API:I][0.042933] matmul primitive create [CORE:I][0.057558] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.057561] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.042832] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.042967] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.136ms graph_exe_count=-1 weight_address=0x21175d40 [PROF:I][0.057537] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.154055,ms [API:I][0.043097] matmul desc create - no bias [CORE:I][0.057767] matmul desc init [matmul] [API:I][0.043105] matmul primitive_desc create - attr [PROF:I][0.057557] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.0008,ms [API:I][0.043116] matmul primitive create [CORE:I][0.057741] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.057744] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.043015] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.043161] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.147ms graph_exe_count=-1 weight_address=0x22175d80 [PROF:I][0.057730] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.164305,ms [PROF:V0][0.043290] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,0.974121,ms [CORE:I][0.057714] CPU Stream deleted [stream] [CORE:I][0.058118] CPU Engine deleted [engine] [API:I][0.058390] Memory create [CORE:V0][0.058378] Memory desc init by tag [memory] [CORE:I][0.058383] Memory created [memory] [API:I][0.058404] Memory create - strides [CORE:I][0.058390] Memory desc init by Stride [memory] [CORE:I][0.058395] Memory created [memory] [API:I][0.058416] Memory create [CORE:V0][0.058403] Memory desc init by tag [memory] [CORE:I][0.058406] Memory created [memory] [API:I][0.058336] matmul desc create - no bias [CORE:I][0.058333] matmul desc init [matmul] [API:I][0.058347] matmul primitive_desc create - attr [PROF:I][0.058127] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00172,ms [API:I][0.058359] matmul primitive create [CORE:I][0.058312] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.058315] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.043586] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.044098] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.511ms graph_exe_count=-1 weight_address=0x70dd67fdf040 [PROF:I][0.058667] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.531058,ms [API:I][0.059076] Memory create [CORE:V0][0.059063] Memory desc init by tag [memory] [CORE:I][0.059067] Memory created [memory] [API:I][0.059088] Memory create - strides [CORE:I][0.059074] Memory desc init by Stride [memory] [CORE:I][0.059077] Memory created [memory] [API:I][0.059098] Memory create [CORE:V0][0.059085] Memory desc init by tag [memory] [CORE:I][0.059089] Memory created [memory] [API:I][0.059017] matmul desc create - no bias [CORE:I][0.059014] matmul desc init [matmul] [API:I][0.059027] matmul primitive_desc create - attr [PROF:I][0.058807] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00149,ms [API:I][0.059038] matmul primitive create [CORE:I][0.058991] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.058995] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.044265] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.046053] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.788ms graph_exe_count=-1 weight_address=0x70dd6bfe0040 [PROF:I][0.060625] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.80911,ms [API:I][0.060990] Memory create [CORE:V0][0.060977] Memory desc init by tag [memory] [CORE:I][0.060981] Memory created [memory] [API:I][0.061002] Memory create - strides [CORE:I][0.060987] Memory desc init by Stride [memory] [CORE:I][0.060991] Memory created [memory] [API:I][0.061013] Memory create [CORE:V0][0.060998] Memory desc init by tag [memory] [CORE:I][0.061003] Memory created [memory] [API:I][0.061025] Memory create [CORE:V0][0.061009] Memory desc init by tag [memory] [CORE:I][0.061013] Memory created [memory] [API:I][0.061036] Memory create [CORE:V0][0.061022] Memory desc init by tag [memory] [CORE:I][0.061026] Memory created [memory] [API:I][0.060956] matmul desc create - no bias [CORE:I][0.060953] matmul desc init [matmul] [API:I][0.060967] matmul primitive_desc create - attr [PROF:I][0.060748] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00181,ms [API:I][0.060980] matmul primitive create [CORE:I][0.060934] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.060938] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.046210] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.047925] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.716ms graph_exe_count=-1 weight_address=0x70dd79fe1040 [PROF:I][0.062496] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.73859,ms [API:I][0.062860] Memory create [CORE:V0][0.062846] Memory desc init by tag [memory] [CORE:I][0.062851] Memory created [memory] [API:I][0.062872] Memory create - strides [CORE:I][0.062857] Memory desc init by Stride [memory] [CORE:I][0.062862] Memory created [memory] [API:I][0.062883] Memory create [CORE:V0][0.062869] Memory desc init by tag [memory] [CORE:I][0.062873] Memory created [memory] [API:I][0.062802] matmul desc create - no bias [CORE:I][0.062799] matmul desc init [matmul] [API:I][0.062814] matmul primitive_desc create - attr [PROF:I][0.062593] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00165,ms [API:I][0.062825] matmul primitive create [CORE:I][0.062779] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.062783] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.048054] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.049605] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.552ms graph_exe_count=-1 weight_address=0x70dd87fe2040 [PROF:I][0.064175] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.57137,ms [API:I][0.064632] Memory create [CORE:V0][0.064621] Memory desc init by tag [memory] [CORE:I][0.064626] Memory created [memory] [API:I][0.064647] Memory create - strides [CORE:I][0.064633] Memory desc init by Stride [memory] [CORE:I][0.064638] Memory created [memory] [API:I][0.064659] Memory create [CORE:V0][0.064645] Memory desc init by tag [memory] [CORE:I][0.064649] Memory created [memory] [API:I][0.064671] Memory create [CORE:V0][0.064657] Memory desc init by tag [memory] [CORE:I][0.064660] Memory created [memory] [API:I][0.064681] Memory create - strides [CORE:I][0.064666] Memory desc init by Stride [memory] [CORE:I][0.064670] Memory created [memory] [API:I][0.064692] Memory create [CORE:V0][0.064678] Memory desc init by tag [memory] [CORE:I][0.064682] Memory created [memory] [API:I][0.064704] Memory create [CORE:V0][0.064689] Memory desc init by tag [memory] [CORE:I][0.064693] Memory created [memory] [API:I][0.064715] Memory create - strides [CORE:I][0.064700] Memory desc init by Stride [memory] [CORE:I][0.064705] Memory created [memory] [API:I][0.064726] Memory create [CORE:V0][0.064712] Memory desc init by tag [memory] [CORE:I][0.064715] Memory created [memory] [API:I][0.049971] CPU Engine create [CORE:V0][0.064793] CPU Engine created [engine] [CORE:I][0.064797] CPU Engine created [cpu/engine] [API:I][0.049982] CPU Stream create [CORE:I][0.064404] CPU Stream created [stream] [CORE:V0][0.064403] CPU Stream created [cpu/stream] [API:I][0.049997] matmul desc create - no bias [CORE:I][0.064667] matmul desc init [matmul] [API:I][0.050010] matmul primitive_desc create - attr [PROF:I][0.064463] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00199,ms [API:I][0.050022] matmul primitive create [CORE:I][0.064649] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.064652] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.049925] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.050485] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.561ms graph_exe_count=-1 weight_address=0x70dd31fd9040 [PROF:I][0.065055] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.582029,ms [API:I][0.050616] matmul desc create - no bias [CORE:I][0.065286] matmul desc init [matmul] [API:I][0.050626] matmul primitive_desc create - attr [PROF:I][0.065077] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00111,ms [API:I][0.050636] matmul primitive create [CORE:I][0.065262] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.065266] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.050536] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.050684] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.148ms graph_exe_count=-1 weight_address=0x2317de40 [PROF:I][0.065254] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.167396,ms [API:I][0.050815] matmul desc create - no bias [CORE:I][0.065485] matmul desc init [matmul] [API:I][0.050824] matmul primitive_desc create - attr [PROF:I][0.065274] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00078,ms [API:I][0.050833] matmul primitive create [CORE:I][0.065459] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.065464] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.050734] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.050870] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.136ms graph_exe_count=-1 weight_address=0x2417de80 [PROF:I][0.065440] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.155816,ms [PROF:V0][0.051000] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.02905,ms [CORE:I][0.065424] CPU Stream deleted [stream] [CORE:I][0.065828] CPU Engine deleted [engine] [API:I][0.066094] Memory create [CORE:V0][0.066083] Memory desc init by tag [memory] [CORE:I][0.066088] Memory created [memory] [API:I][0.066110] Memory create - strides [CORE:I][0.066096] Memory desc init by Stride [memory] [CORE:I][0.066099] Memory created [memory] [API:I][0.066121] Memory create [CORE:V0][0.066107] Memory desc init by tag [memory] [CORE:I][0.066111] Memory created [memory] [API:I][0.066040] matmul desc create - no bias [CORE:I][0.066037] matmul desc init [matmul] [API:I][0.066051] matmul primitive_desc create - attr [PROF:I][0.065831] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00173,ms [API:I][0.066064] matmul primitive create [CORE:I][0.066017] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.066021] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.051292] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.051852] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.56ms graph_exe_count=-1 weight_address=0x70dd35fda040 [PROF:I][0.066424] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.582339,ms [API:I][0.066826] Memory create [CORE:V0][0.066813] Memory desc init by tag [memory] [CORE:I][0.066817] Memory created [memory] [API:I][0.066838] Memory create - strides [CORE:I][0.066823] Memory desc init by Stride [memory] [CORE:I][0.066829] Memory created [memory] [API:I][0.066850] Memory create [CORE:V0][0.066836] Memory desc init by tag [memory] [CORE:I][0.066840] Memory created [memory] [API:I][0.066769] matmul desc create - no bias [CORE:I][0.066766] matmul desc init [matmul] [API:I][0.066779] matmul primitive_desc create - attr [PROF:I][0.066558] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00122,ms [API:I][0.066789] matmul primitive create [CORE:I][0.066742] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.066745] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.052016] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.053919] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.903ms graph_exe_count=-1 weight_address=0x70dd39fdb040 [PROF:I][0.068490] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.92422,ms [API:I][0.068847] Memory create [CORE:V0][0.068834] Memory desc init by tag [memory] [CORE:I][0.068839] Memory created [memory] [API:I][0.068860] Memory create - strides [CORE:I][0.068845] Memory desc init by Stride [memory] [CORE:I][0.068850] Memory created [memory] [API:I][0.068871] Memory create [CORE:V0][0.068857] Memory desc init by tag [memory] [CORE:I][0.068862] Memory created [memory] [API:I][0.068884] Memory create [CORE:V0][0.068869] Memory desc init by tag [memory] [CORE:I][0.068873] Memory created [memory] [API:I][0.068897] Memory create [CORE:V0][0.068882] Memory desc init by tag [memory] [CORE:I][0.068886] Memory created [memory] [API:I][0.068816] matmul desc create - no bias [CORE:I][0.068813] matmul desc init [matmul] [API:I][0.068829] matmul primitive_desc create - attr [PROF:I][0.068609] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00166,ms [API:I][0.068842] matmul primitive create [CORE:I][0.068796] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.068800] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.054072] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.055966] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.896ms graph_exe_count=-1 weight_address=0x70dd47fdc040 [PROF:I][0.070538] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.9179,ms [API:I][0.070899] Memory create [CORE:V0][0.070886] Memory desc init by tag [memory] [CORE:I][0.070891] Memory created [memory] [API:I][0.070912] Memory create - strides [CORE:I][0.070897] Memory desc init by Stride [memory] [CORE:I][0.070902] Memory created [memory] [API:I][0.070923] Memory create [CORE:V0][0.070910] Memory desc init by tag [memory] [CORE:I][0.070914] Memory created [memory] [API:I][0.070842] matmul desc create - no bias [CORE:I][0.070840] matmul desc init [matmul] [API:I][0.070853] matmul primitive_desc create - attr [PROF:I][0.070633] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00149,ms [API:I][0.070864] matmul primitive create [CORE:I][0.070818] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.070831] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.056104] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.057677] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.574ms graph_exe_count=-1 weight_address=0x70dd55fdd040 [PROF:I][0.072248] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.60642,ms [API:I][0.072714] Memory create [CORE:V0][0.072703] Memory desc init by tag [memory] [CORE:I][0.072708] Memory created [memory] [API:I][0.072730] Memory create - strides [CORE:I][0.072715] Memory desc init by Stride [memory] [CORE:I][0.072720] Memory created [memory] [API:I][0.072743] Memory create [CORE:V0][0.072728] Memory desc init by tag [memory] [CORE:I][0.072732] Memory created [memory] [API:I][0.072754] Memory create [CORE:V0][0.072741] Memory desc init by tag [memory] [CORE:I][0.072746] Memory created [memory] [API:I][0.072769] Memory create - strides [CORE:I][0.072754] Memory desc init by Stride [memory] [CORE:I][0.072759] Memory created [memory] [API:I][0.072781] Memory create [CORE:V0][0.072766] Memory desc init by tag [memory] [CORE:I][0.072770] Memory created [memory] [API:I][0.072792] Memory create [CORE:V0][0.072777] Memory desc init by tag [memory] [CORE:I][0.072781] Memory created [memory] [API:I][0.072802] Memory create - strides [CORE:I][0.072788] Memory desc init by Stride [memory] [CORE:I][0.072793] Memory created [memory] [API:I][0.072814] Memory create [CORE:V0][0.072799] Memory desc init by tag [memory] [CORE:I][0.072804] Memory created [memory] [API:I][0.058059] CPU Engine create [CORE:V0][0.072881] CPU Engine created [engine] [CORE:I][0.072885] CPU Engine created [cpu/engine] [API:I][0.058070] CPU Stream create [CORE:I][0.072492] CPU Stream created [stream] [CORE:V0][0.072491] CPU Stream created [cpu/stream] [API:I][0.058086] matmul desc create - no bias [CORE:I][0.072756] matmul desc init [matmul] [API:I][0.058098] matmul primitive_desc create - attr [PROF:I][0.072551] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00182,ms [API:I][0.058111] matmul primitive create [CORE:I][0.072737] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.072740] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.058013] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.058505] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.493ms graph_exe_count=-1 weight_address=0x70dcfffd4040 [PROF:I][0.073075] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.514267,ms [API:I][0.058636] matmul desc create - no bias [CORE:I][0.073306] matmul desc init [matmul] [API:I][0.058646] matmul primitive_desc create - attr [PROF:I][0.073098] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00156,ms [API:I][0.058657] matmul primitive create [CORE:I][0.073282] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.073284] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.058555] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.058689] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.134ms graph_exe_count=-1 weight_address=0x25185f40 [PROF:I][0.073259] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.152535,ms [API:I][0.058819] matmul desc create - no bias [CORE:I][0.073489] matmul desc init [matmul] [API:I][0.058827] matmul primitive_desc create - attr [PROF:I][0.073278] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00087,ms [API:I][0.058836] matmul primitive create [CORE:I][0.073462] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.073465] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.058735] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.058883] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.148ms graph_exe_count=-1 weight_address=0x26185f80 [PROF:I][0.073453] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.165745,ms [PROF:V0][0.059012] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,0.951904,ms [CORE:I][0.073436] CPU Stream deleted [stream] [CORE:I][0.073840] CPU Engine deleted [engine] [API:I][0.074121] Memory create [CORE:V0][0.074109] Memory desc init by tag [memory] [CORE:I][0.074114] Memory created [memory] [API:I][0.074135] Memory create - strides [CORE:I][0.074121] Memory desc init by Stride [memory] [CORE:I][0.074125] Memory created [memory] [API:I][0.074146] Memory create [CORE:V0][0.074133] Memory desc init by tag [memory] [CORE:I][0.074136] Memory created [memory] [API:I][0.074066] matmul desc create - no bias [CORE:I][0.074063] matmul desc init [matmul] [API:I][0.074077] matmul primitive_desc create - attr [PROF:I][0.073857] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.0017,ms [API:I][0.074090] matmul primitive create [CORE:I][0.074043] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.074047] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.059318] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.059808] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.489ms graph_exe_count=-1 weight_address=0x70dd03fd5040 [PROF:I][0.074378] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.510317,ms [API:I][0.074782] Memory create [CORE:V0][0.074769] Memory desc init by tag [memory] [CORE:I][0.074773] Memory created [memory] [API:I][0.074794] Memory create - strides [CORE:I][0.074779] Memory desc init by Stride [memory] [CORE:I][0.074782] Memory created [memory] [API:I][0.074805] Memory create [CORE:V0][0.074789] Memory desc init by tag [memory] [CORE:I][0.074793] Memory created [memory] [API:I][0.074723] matmul desc create - no bias [CORE:I][0.074720] matmul desc init [matmul] [API:I][0.074733] matmul primitive_desc create - attr [PROF:I][0.074513] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00121,ms [API:I][0.074745] matmul primitive create [CORE:I][0.074698] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.074702] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.059972] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.061745] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.773ms graph_exe_count=-1 weight_address=0x70dd07fd6040 [PROF:I][0.076316] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.7938,ms [API:I][0.076672] Memory create [CORE:V0][0.076659] Memory desc init by tag [memory] [CORE:I][0.076664] Memory created [memory] [API:I][0.076685] Memory create - strides [CORE:I][0.076669] Memory desc init by Stride [memory] [CORE:I][0.076674] Memory created [memory] [API:I][0.076696] Memory create [CORE:V0][0.076682] Memory desc init by tag [memory] [CORE:I][0.076686] Memory created [memory] [API:I][0.076708] Memory create [CORE:V0][0.076693] Memory desc init by tag [memory] [CORE:I][0.076697] Memory created [memory] [API:I][0.076722] Memory create [CORE:V0][0.076708] Memory desc init by tag [memory] [CORE:I][0.076711] Memory created [memory] [API:I][0.076642] matmul desc create - no bias [CORE:I][0.076640] matmul desc init [matmul] [API:I][0.076654] matmul primitive_desc create - attr [PROF:I][0.076434] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00151,ms [API:I][0.076666] matmul primitive create [CORE:I][0.076619] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.076622] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.061894] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.063592] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.699ms graph_exe_count=-1 weight_address=0x70dd15fd7040 [PROF:I][0.078162] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.71814,ms [API:I][0.078521] Memory create [CORE:V0][0.078509] Memory desc init by tag [memory] [CORE:I][0.078513] Memory created [memory] [API:I][0.078534] Memory create - strides [CORE:I][0.078520] Memory desc init by Stride [memory] [CORE:I][0.078523] Memory created [memory] [API:I][0.078544] Memory create [CORE:V0][0.078530] Memory desc init by tag [memory] [CORE:I][0.078534] Memory created [memory] [API:I][0.078462] matmul desc create - no bias [CORE:I][0.078461] matmul desc init [matmul] [API:I][0.078475] matmul primitive_desc create - attr [PROF:I][0.078255] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00144,ms [API:I][0.078486] matmul primitive create [CORE:I][0.078439] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.078442] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.063713] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.065341] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.629ms graph_exe_count=-1 weight_address=0x70dd23fd8040 [PROF:I][0.079911] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.64837,ms [API:I][0.080364] Memory create [CORE:V0][0.080352] Memory desc init by tag [memory] [CORE:I][0.080357] Memory created [memory] [API:I][0.080378] Memory create - strides [CORE:I][0.080364] Memory desc init by Stride [memory] [CORE:I][0.080369] Memory created [memory] [API:I][0.080389] Memory create [CORE:V0][0.080376] Memory desc init by tag [memory] [CORE:I][0.080379] Memory created [memory] [API:I][0.080401] Memory create [CORE:V0][0.080386] Memory desc init by tag [memory] [CORE:I][0.080391] Memory created [memory] [API:I][0.080412] Memory create - strides [CORE:I][0.080397] Memory desc init by Stride [memory] [CORE:I][0.080400] Memory created [memory] [API:I][0.080421] Memory create [CORE:V0][0.080407] Memory desc init by tag [memory] [CORE:I][0.080411] Memory created [memory] [API:I][0.080434] Memory create [CORE:V0][0.080419] Memory desc init by tag [memory] [CORE:I][0.080423] Memory created [memory] [API:I][0.080445] Memory create - strides [CORE:I][0.080431] Memory desc init by Stride [memory] [CORE:I][0.080433] Memory created [memory] [API:I][0.080454] Memory create [CORE:V0][0.080440] Memory desc init by tag [memory] [CORE:I][0.080445] Memory created [memory] [API:I][0.065701] CPU Engine create [CORE:V0][0.080524] CPU Engine created [engine] [CORE:I][0.080527] CPU Engine created [cpu/engine] [API:I][0.065713] CPU Stream create [CORE:I][0.080135] CPU Stream created [stream] [CORE:V0][0.080134] CPU Stream created [cpu/stream] [API:I][0.065728] matmul desc create - no bias [CORE:I][0.080398] matmul desc init [matmul] [API:I][0.065739] matmul primitive_desc create - attr [PROF:I][0.080193] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00202,ms [API:I][0.065752] matmul primitive create [CORE:I][0.080378] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.080382] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.065653] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.066176] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.523ms graph_exe_count=-1 weight_address=0x70dccdfcf040 [PROF:I][0.080746] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.543858,ms [API:I][0.066308] matmul desc create - no bias [CORE:I][0.080978] matmul desc init [matmul] [API:I][0.066317] matmul primitive_desc create - attr [PROF:I][0.080769] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00108,ms [API:I][0.066328] matmul primitive create [CORE:I][0.080952] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.080955] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.066226] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.066369] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.144ms graph_exe_count=-1 weight_address=0x2718e040 [PROF:I][0.080938] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.161595,ms [API:I][0.066500] matmul desc create - no bias [CORE:I][0.081170] matmul desc init [matmul] [API:I][0.066508] matmul primitive_desc create - attr [PROF:I][0.080959] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00087,ms [API:I][0.066518] matmul primitive create [CORE:I][0.081143] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.081146] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.066416] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.066563] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.147ms graph_exe_count=-1 weight_address=0x2818e080 [PROF:I][0.081132] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.164246,ms [PROF:V0][0.066693] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,0.990967,ms [CORE:I][0.081117] CPU Stream deleted [stream] [CORE:I][0.081521] CPU Engine deleted [engine] [API:I][0.081778] Memory create [CORE:V0][0.081767] Memory desc init by tag [memory] [CORE:I][0.081772] Memory created [memory] [API:I][0.081794] Memory create - strides [CORE:I][0.081780] Memory desc init by Stride [memory] [CORE:I][0.081784] Memory created [memory] [API:I][0.081805] Memory create [CORE:V0][0.081790] Memory desc init by tag [memory] [CORE:I][0.081794] Memory created [memory] [API:I][0.081723] matmul desc create - no bias [CORE:I][0.081720] matmul desc init [matmul] [API:I][0.081734] matmul primitive_desc create - attr [PROF:I][0.081514] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00166,ms [API:I][0.081746] matmul primitive create [CORE:I][0.081698] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.081702] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.066973] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.067456] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.482ms graph_exe_count=-1 weight_address=0x70dcd1fd0040 [PROF:I][0.082026] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.502996,ms [API:I][0.082427] Memory create [CORE:V0][0.082414] Memory desc init by tag [memory] [CORE:I][0.082418] Memory created [memory] [API:I][0.082440] Memory create - strides [CORE:I][0.082425] Memory desc init by Stride [memory] [CORE:I][0.082429] Memory created [memory] [API:I][0.082450] Memory create [CORE:V0][0.082435] Memory desc init by tag [memory] [CORE:I][0.082440] Memory created [memory] [API:I][0.082367] matmul desc create - no bias [CORE:I][0.082366] matmul desc init [matmul] [API:I][0.082378] matmul primitive_desc create - attr [PROF:I][0.082158] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00139,ms [API:I][0.082389] matmul primitive create [CORE:I][0.082342] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.082345] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.067616] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.069355] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.74ms graph_exe_count=-1 weight_address=0x70dcd5fd1040 [PROF:I][0.083925] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.75885,ms [API:I][0.084282] Memory create [CORE:V0][0.084269] Memory desc init by tag [memory] [CORE:I][0.084273] Memory created [memory] [API:I][0.084294] Memory create - strides [CORE:I][0.084281] Memory desc init by Stride [memory] [CORE:I][0.084285] Memory created [memory] [API:I][0.084308] Memory create [CORE:V0][0.084293] Memory desc init by tag [memory] [CORE:I][0.084296] Memory created [memory] [API:I][0.084319] Memory create [CORE:V0][0.084304] Memory desc init by tag [memory] [CORE:I][0.084309] Memory created [memory] [API:I][0.084334] Memory create [CORE:V0][0.084319] Memory desc init by tag [memory] [CORE:I][0.084323] Memory created [memory] [API:I][0.084255] matmul desc create - no bias [CORE:I][0.084252] matmul desc init [matmul] [API:I][0.084268] matmul primitive_desc create - attr [PROF:I][0.084048] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00184,ms [API:I][0.084279] matmul primitive create [CORE:I][0.084233] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.084237] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.069508] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.071261] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.754ms graph_exe_count=-1 weight_address=0x70dce3fd2040 [PROF:I][0.085834] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.77667,ms [API:I][0.086199] Memory create [CORE:V0][0.086185] Memory desc init by tag [memory] [CORE:I][0.086190] Memory created [memory] [API:I][0.086212] Memory create - strides [CORE:I][0.086197] Memory desc init by Stride [memory] [CORE:I][0.086202] Memory created [memory] [API:I][0.086223] Memory create [CORE:V0][0.086209] Memory desc init by tag [memory] [CORE:I][0.086213] Memory created [memory] [API:I][0.086141] matmul desc create - no bias [CORE:I][0.086139] matmul desc init [matmul] [API:I][0.086154] matmul primitive_desc create - attr [PROF:I][0.085934] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00174,ms [API:I][0.086166] matmul primitive create [CORE:I][0.086120] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.086123] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.071395] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.073034] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.64ms graph_exe_count=-1 weight_address=0x70dcf1fd3040 [PROF:I][0.087605] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.6603,ms [API:I][0.088094] Memory create [CORE:V0][0.088084] Memory desc init by tag [memory] [CORE:I][0.088089] Memory created [memory] [API:I][0.088110] Memory create - strides [CORE:I][0.088095] Memory desc init by Stride [memory] [CORE:I][0.088100] Memory created [memory] [API:I][0.088121] Memory create [CORE:V0][0.088107] Memory desc init by tag [memory] [CORE:I][0.088111] Memory created [memory] [API:I][0.088133] Memory create [CORE:V0][0.088118] Memory desc init by tag [memory] [CORE:I][0.088122] Memory created [memory] [API:I][0.088143] Memory create - strides [CORE:I][0.088128] Memory desc init by Stride [memory] [CORE:I][0.088134] Memory created [memory] [API:I][0.088155] Memory create [CORE:V0][0.088140] Memory desc init by tag [memory] [CORE:I][0.088144] Memory created [memory] [API:I][0.088166] Memory create [CORE:V0][0.088152] Memory desc init by tag [memory] [CORE:I][0.088155] Memory created [memory] [API:I][0.088176] Memory create - strides [CORE:I][0.088161] Memory desc init by Stride [memory] [CORE:I][0.088165] Memory created [memory] [API:I][0.088186] Memory create [CORE:V0][0.088172] Memory desc init by tag [memory] [CORE:I][0.088175] Memory created [memory] [API:I][0.073431] CPU Engine create [CORE:V0][0.088252] CPU Engine created [engine] [CORE:I][0.088257] CPU Engine created [cpu/engine] [API:I][0.073442] CPU Stream create [CORE:I][0.087865] CPU Stream created [stream] [CORE:V0][0.087864] CPU Stream created [cpu/stream] [API:I][0.073458] matmul desc create - no bias [CORE:I][0.088129] matmul desc init [matmul] [API:I][0.073471] matmul primitive_desc create - attr [PROF:I][0.087924] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00197,ms [API:I][0.073483] matmul primitive create [CORE:I][0.088108] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.088112] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.073385] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.073924] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.539ms graph_exe_count=-1 weight_address=0x70dc9bfca040 [PROF:I][0.088495] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.562229,ms [API:I][0.074055] matmul desc create - no bias [CORE:I][0.088725] matmul desc init [matmul] [API:I][0.074064] matmul primitive_desc create - attr [PROF:I][0.088515] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.001111,ms [API:I][0.074074] matmul primitive create [CORE:I][0.088700] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.088703] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.073973] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.074134] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.161ms graph_exe_count=-1 weight_address=0x29196140 [PROF:I][0.088704] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.179296,ms [API:I][0.074264] matmul desc create - no bias [CORE:I][0.088934] matmul desc init [matmul] [API:I][0.074273] matmul primitive_desc create - attr [PROF:I][0.088725] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.0014,ms [API:I][0.074284] matmul primitive create [CORE:I][0.088909] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.088912] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.074183] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.074337] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.154ms graph_exe_count=-1 weight_address=0x2a196180 [PROF:I][0.088907] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.172955,ms [PROF:V0][0.074467] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.03711,ms [CORE:I][0.088891] CPU Stream deleted [stream] [CORE:I][0.089295] CPU Engine deleted [engine] [API:I][0.089565] Memory create [CORE:V0][0.089554] Memory desc init by tag [memory] [CORE:I][0.089558] Memory created [memory] [API:I][0.089580] Memory create - strides [CORE:I][0.089565] Memory desc init by Stride [memory] [CORE:I][0.089570] Memory created [memory] [API:I][0.089592] Memory create [CORE:V0][0.089578] Memory desc init by tag [memory] [CORE:I][0.089582] Memory created [memory] [API:I][0.089511] matmul desc create - no bias [CORE:I][0.089508] matmul desc init [matmul] [API:I][0.089522] matmul primitive_desc create - attr [PROF:I][0.089302] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00158,ms [API:I][0.089533] matmul primitive create [CORE:I][0.089487] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.089491] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.074762] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.075302] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.54ms graph_exe_count=-1 weight_address=0x70dc9ffcb040 [PROF:I][0.089872] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.561288,ms [API:I][0.090279] Memory create [CORE:V0][0.090267] Memory desc init by tag [memory] [CORE:I][0.090271] Memory created [memory] [API:I][0.090293] Memory create - strides [CORE:I][0.090277] Memory desc init by Stride [memory] [CORE:I][0.090281] Memory created [memory] [API:I][0.090302] Memory create [CORE:V0][0.090287] Memory desc init by tag [memory] [CORE:I][0.090291] Memory created [memory] [API:I][0.090219] matmul desc create - no bias [CORE:I][0.090216] matmul desc init [matmul] [API:I][0.090230] matmul primitive_desc create - attr [PROF:I][0.090009] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00158,ms [API:I][0.090241] matmul primitive create [CORE:I][0.090195] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.090199] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.075470] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.077196] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.726ms graph_exe_count=-1 weight_address=0x70dca3fcc040 [PROF:I][0.091768] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.74892,ms [API:I][0.092125] Memory create [CORE:V0][0.092112] Memory desc init by tag [memory] [CORE:I][0.092117] Memory created [memory] [API:I][0.092138] Memory create - strides [CORE:I][0.092124] Memory desc init by Stride [memory] [CORE:I][0.092127] Memory created [memory] [API:I][0.092148] Memory create [CORE:V0][0.092133] Memory desc init by tag [memory] [CORE:I][0.092138] Memory created [memory] [API:I][0.092160] Memory create [CORE:V0][0.092146] Memory desc init by tag [memory] [CORE:I][0.092151] Memory created [memory] [API:I][0.092176] Memory create [CORE:V0][0.092161] Memory desc init by tag [memory] [CORE:I][0.092165] Memory created [memory] [API:I][0.092096] matmul desc create - no bias [CORE:I][0.092093] matmul desc init [matmul] [API:I][0.092107] matmul primitive_desc create - attr [PROF:I][0.091887] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00149,ms [API:I][0.092119] matmul primitive create [CORE:I][0.092074] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.092077] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.077349] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.079128] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.78ms graph_exe_count=-1 weight_address=0x70dcb1fcd040 [PROF:I][0.093700] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.80301,ms [API:I][0.094087] Memory create [CORE:V0][0.094074] Memory desc init by tag [memory] [CORE:I][0.094079] Memory created [memory] [API:I][0.094103] Memory create - strides [CORE:I][0.094091] Memory desc init by Stride [memory] [CORE:I][0.094096] Memory created [memory] [API:I][0.094118] Memory create [CORE:V0][0.094105] Memory desc init by tag [memory] [CORE:I][0.094112] Memory created [memory] [API:I][0.094042] matmul desc create - no bias [CORE:I][0.094043] matmul desc init [matmul] [API:I][0.094060] matmul primitive_desc create - attr [PROF:I][0.093843] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00272,ms [API:I][0.094076] matmul primitive create [CORE:I][0.094031] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.094036] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.079308] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.080855] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.548ms graph_exe_count=-1 weight_address=0x70dcbffce040 [PROF:I][0.095425] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.56992,ms [API:I][0.095887] Memory create [CORE:V0][0.095875] Memory desc init by tag [memory] [CORE:I][0.095882] Memory created [memory] [API:I][0.095905] Memory create - strides [CORE:I][0.095893] Memory desc init by Stride [memory] [CORE:I][0.095899] Memory created [memory] [API:I][0.095922] Memory create [CORE:V0][0.095909] Memory desc init by tag [memory] [CORE:I][0.095915] Memory created [memory] [API:I][0.095940] Memory create [CORE:V0][0.095927] Memory desc init by tag [memory] [CORE:I][0.095933] Memory created [memory] [API:I][0.095957] Memory create - strides [CORE:I][0.095944] Memory desc init by Stride [memory] [CORE:I][0.095959] Memory created [memory] [API:I][0.095983] Memory create [CORE:V0][0.095968] Memory desc init by tag [memory] [CORE:I][0.095975] Memory created [memory] [API:I][0.096000] Memory create [CORE:V0][0.095988] Memory desc init by tag [memory] [CORE:I][0.095993] Memory created [memory] [API:I][0.096016] Memory create - strides [CORE:I][0.096003] Memory desc init by Stride [memory] [CORE:I][0.096010] Memory created [memory] [API:I][0.096033] Memory create [CORE:V0][0.096020] Memory desc init by tag [memory] [CORE:I][0.096028] Memory created [memory] [API:I][0.081286] CPU Engine create [CORE:V0][0.096110] CPU Engine created [engine] [CORE:I][0.096115] CPU Engine created [cpu/engine] [API:I][0.081303] CPU Stream create [CORE:I][0.095725] CPU Stream created [stream] [CORE:V0][0.095725] CPU Stream created [cpu/stream] [API:I][0.081320] matmul desc create - no bias [CORE:I][0.095990] matmul desc init [matmul] [API:I][0.081334] matmul primitive_desc create - attr [PROF:I][0.095789] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.002961,ms [API:I][0.081350] matmul primitive create [CORE:I][0.095978] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.095982] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.081255] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.081753] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.498ms graph_exe_count=-1 weight_address=0x70de2bff2040 [PROF:I][0.096323] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.521217,ms [API:I][0.081887] matmul desc create - no bias [CORE:I][0.096559] matmul desc init [matmul] [API:I][0.081904] matmul primitive_desc create - attr [PROF:I][0.096357] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00164,ms [API:I][0.081917] matmul primitive create [CORE:I][0.096544] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.096548] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.081820] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.081970] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.15ms graph_exe_count=-1 weight_address=0x19155940 [PROF:I][0.096540] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.170965,ms [API:I][0.082105] matmul desc create - no bias [CORE:I][0.096777] matmul desc init [matmul] [API:I][0.082119] matmul primitive_desc create - attr [PROF:I][0.096573] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00159,ms [API:I][0.082133] matmul primitive create [CORE:I][0.096762] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.096767] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.082041] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.082197] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.156ms graph_exe_count=-1 weight_address=0x1a155980 [PROF:I][0.096767] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.180636,ms [PROF:V0][0.082331] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.04517,ms [CORE:I][0.096757] CPU Stream deleted [stream] [CORE:I][0.097163] CPU Engine deleted [engine] [API:I][0.097420] Memory create [CORE:V0][0.097410] Memory desc init by tag [memory] [CORE:I][0.097418] Memory created [memory] [API:I][0.097441] Memory create - strides [CORE:I][0.097427] Memory desc init by Stride [memory] [CORE:I][0.097433] Memory created [memory] [API:I][0.097456] Memory create [CORE:V0][0.097443] Memory desc init by tag [memory] [CORE:I][0.097448] Memory created [memory] [API:I][0.097379] matmul desc create - no bias [CORE:I][0.097378] matmul desc init [matmul] [API:I][0.097395] matmul primitive_desc create - attr [PROF:I][0.097179] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.002371,ms [API:I][0.097412] matmul primitive create [CORE:I][0.097368] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.097372] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.082644] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.083183] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.539ms graph_exe_count=-1 weight_address=0x70de2fff3040 [PROF:I][0.097754] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.561378,ms [API:I][0.098153] Memory create [CORE:V0][0.098141] Memory desc init by tag [memory] [CORE:I][0.098148] Memory created [memory] [API:I][0.098170] Memory create - strides [CORE:I][0.098159] Memory desc init by Stride [memory] [CORE:I][0.098164] Memory created [memory] [API:I][0.098186] Memory create [CORE:V0][0.098172] Memory desc init by tag [memory] [CORE:I][0.098178] Memory created [memory] [API:I][0.098109] matmul desc create - no bias [CORE:I][0.098108] matmul desc init [matmul] [API:I][0.098125] matmul primitive_desc create - attr [PROF:I][0.097907] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.002151,ms [API:I][0.098140] matmul primitive create [CORE:I][0.098096] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.098100] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.083371] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.085098] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.726ms graph_exe_count=-1 weight_address=0x70de33ff4040 [PROF:I][0.099669] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.74863,ms [API:I][0.100030] Memory create [CORE:V0][0.100019] Memory desc init by tag [memory] [CORE:I][0.100026] Memory created [memory] [API:I][0.100048] Memory create - strides [CORE:I][0.100036] Memory desc init by Stride [memory] [CORE:I][0.100041] Memory created [memory] [API:I][0.100065] Memory create [CORE:V0][0.100051] Memory desc init by tag [memory] [CORE:I][0.100058] Memory created [memory] [API:I][0.100082] Memory create [CORE:V0][0.100068] Memory desc init by tag [memory] [CORE:I][0.100073] Memory created [memory] [API:I][0.100099] Memory create [CORE:V0][0.100087] Memory desc init by tag [memory] [CORE:I][0.100093] Memory created [memory] [API:I][0.100027] matmul desc create - no bias [CORE:I][0.100025] matmul desc init [matmul] [API:I][0.100043] matmul primitive_desc create - attr [PROF:I][0.099824] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00203,ms [API:I][0.100056] matmul primitive create [CORE:I][0.100011] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.100016] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.085292] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.087046] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.756ms graph_exe_count=-1 weight_address=0x70de41ff5040 [PROF:I][0.101617] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.78139,ms [API:I][0.101997] Memory create [CORE:V0][0.101985] Memory desc init by tag [memory] [CORE:I][0.101994] Memory created [memory] [API:I][0.102018] Memory create - strides [CORE:I][0.102004] Memory desc init by Stride [memory] [CORE:I][0.102011] Memory created [memory] [API:I][0.102033] Memory create [CORE:V0][0.102022] Memory desc init by tag [memory] [CORE:I][0.102027] Memory created [memory] [API:I][0.101958] matmul desc create - no bias [CORE:I][0.101957] matmul desc init [matmul] [API:I][0.101976] matmul primitive_desc create - attr [PROF:I][0.101758] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00253,ms [API:I][0.101992] matmul primitive create [CORE:I][0.101948] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.101953] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.087228] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.088775] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.546ms graph_exe_count=-1 weight_address=0x70e0ddfff040 [PROF:I][0.103346] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.57346,ms [API:I][0.103834] Memory create [CORE:V0][0.103824] Memory desc init by tag [memory] [CORE:I][0.103831] Memory created [memory] [API:I][0.103854] Memory create - strides [CORE:I][0.103842] Memory desc init by Stride [memory] [CORE:I][0.103848] Memory created [memory] [API:I][0.103871] Memory create [CORE:V0][0.103857] Memory desc init by tag [memory] [CORE:I][0.103865] Memory created [memory] [API:I][0.103890] Memory create [CORE:V0][0.103877] Memory desc init by tag [memory] [CORE:I][0.103883] Memory created [memory] [API:I][0.103905] Memory create - strides [CORE:I][0.103892] Memory desc init by Stride [memory] [CORE:I][0.103899] Memory created [memory] [API:I][0.103922] Memory create [CORE:V0][0.103910] Memory desc init by tag [memory] [CORE:I][0.103916] Memory created [memory] [API:I][0.103941] Memory create [CORE:V0][0.103928] Memory desc init by tag [memory] [CORE:I][0.103934] Memory created [memory] [API:I][0.103956] Memory create - strides [CORE:I][0.103945] Memory desc init by Stride [memory] [CORE:I][0.103959] Memory created [memory] [API:I][0.103980] Memory create [CORE:V0][0.103968] Memory desc init by tag [memory] [CORE:I][0.103974] Memory created [memory] [API:I][0.089231] CPU Engine create [CORE:V0][0.104056] CPU Engine created [engine] [CORE:I][0.104061] CPU Engine created [cpu/engine] [API:I][0.089251] CPU Stream create [CORE:I][0.103674] CPU Stream created [stream] [CORE:V0][0.103674] CPU Stream created [cpu/stream] [API:I][0.089270] matmul desc create - no bias [CORE:I][0.103942] matmul desc init [matmul] [API:I][0.089287] matmul primitive_desc create - attr [PROF:I][0.103741] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00203,ms [API:I][0.089301] matmul primitive create [CORE:I][0.103929] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.103934] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.089210] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.089713] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.504ms graph_exe_count=-1 weight_address=0x70e0abffa040 [PROF:I][0.104284] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.531637,ms [API:I][0.089848] matmul desc create - no bias [CORE:I][0.104519] matmul desc init [matmul] [API:I][0.089861] matmul primitive_desc create - attr [PROF:I][0.104315] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00192,ms [API:I][0.089875] matmul primitive create [CORE:I][0.104504] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.104510] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.089785] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.089923] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.139ms graph_exe_count=-1 weight_address=0x2b1a62c0 [PROF:I][0.104493] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.164085,ms [API:I][0.090057] matmul desc create - no bias [CORE:I][0.104730] matmul desc init [matmul] [API:I][0.090072] matmul primitive_desc create - attr [PROF:I][0.104526] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00158,ms [API:I][0.090086] matmul primitive create [CORE:I][0.104715] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.104720] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.089993] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.090151] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.157ms graph_exe_count=-1 weight_address=0x2c1a6300 [PROF:I][0.104721] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.181266,ms [PROF:V0][0.090284] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.05298,ms [CORE:I][0.104711] CPU Stream deleted [stream] [CORE:I][0.105116] CPU Engine deleted [engine] [API:I][0.105384] Memory create [CORE:V0][0.105373] Memory desc init by tag [memory] [CORE:I][0.105380] Memory created [memory] [API:I][0.105403] Memory create - strides [CORE:I][0.105391] Memory desc init by Stride [memory] [CORE:I][0.105396] Memory created [memory] [API:I][0.105419] Memory create [CORE:V0][0.105407] Memory desc init by tag [memory] [CORE:I][0.105412] Memory created [memory] [API:I][0.105344] matmul desc create - no bias [CORE:I][0.105344] matmul desc init [matmul] [API:I][0.105360] matmul primitive_desc create - attr [PROF:I][0.105143] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.002411,ms [API:I][0.105377] matmul primitive create [CORE:I][0.105334] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.105339] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.090614] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.091187] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.574ms graph_exe_count=-1 weight_address=0x70e0afffb040 [PROF:I][0.105758] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.59951,ms [API:I][0.106174] Memory create [CORE:V0][0.106162] Memory desc init by tag [memory] [CORE:I][0.106169] Memory created [memory] [API:I][0.106191] Memory create - strides [CORE:I][0.106177] Memory desc init by Stride [memory] [CORE:I][0.106182] Memory created [memory] [API:I][0.106205] Memory create [CORE:V0][0.106193] Memory desc init by tag [memory] [CORE:I][0.106199] Memory created [memory] [API:I][0.106130] matmul desc create - no bias [CORE:I][0.106129] matmul desc init [matmul] [API:I][0.106147] matmul primitive_desc create - attr [PROF:I][0.105929] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00223,ms [API:I][0.106163] matmul primitive create [CORE:I][0.106120] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.106125] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.091399] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.093209] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.811ms graph_exe_count=-1 weight_address=0x70e0b3ffc040 [PROF:I][0.107782] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.83794,ms [API:I][0.108147] Memory create [CORE:V0][0.108134] Memory desc init by tag [memory] [CORE:I][0.108141] Memory created [memory] [API:I][0.108164] Memory create - strides [CORE:I][0.108150] Memory desc init by Stride [memory] [CORE:I][0.108155] Memory created [memory] [API:I][0.108179] Memory create [CORE:V0][0.108165] Memory desc init by tag [memory] [CORE:I][0.108172] Memory created [memory] [API:I][0.108196] Memory create [CORE:V0][0.108183] Memory desc init by tag [memory] [CORE:I][0.108189] Memory created [memory] [API:I][0.108214] Memory create [CORE:V0][0.108199] Memory desc init by tag [memory] [CORE:I][0.108204] Memory created [memory] [API:I][0.108137] matmul desc create - no bias [CORE:I][0.108136] matmul desc init [matmul] [API:I][0.108153] matmul primitive_desc create - attr [PROF:I][0.107936] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00228,ms [API:I][0.108170] matmul primitive create [CORE:I][0.108128] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.108133] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.093408] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.095199] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.793ms graph_exe_count=-1 weight_address=0x70e0c1ffd040 [PROF:I][0.109770] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.81784,ms [API:I][0.110138] Memory create [CORE:V0][0.110126] Memory desc init by tag [memory] [CORE:I][0.110133] Memory created [memory] [API:I][0.110155] Memory create - strides [CORE:I][0.110142] Memory desc init by Stride [memory] [CORE:I][0.110147] Memory created [memory] [API:I][0.110170] Memory create [CORE:V0][0.110156] Memory desc init by tag [memory] [CORE:I][0.110160] Memory created [memory] [API:I][0.110089] matmul desc create - no bias [CORE:I][0.110087] matmul desc init [matmul] [API:I][0.110103] matmul primitive_desc create - attr [PROF:I][0.109885] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00244,ms [API:I][0.110119] matmul primitive create [CORE:I][0.110074] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.110078] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.095350] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.096917] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.566ms graph_exe_count=-1 weight_address=0x70e0cfffe040 [PROF:I][0.111488] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.58937,ms [API:I][0.111955] Memory create [CORE:V0][0.111945] Memory desc init by tag [memory] [CORE:I][0.111962] Memory created [memory] [API:I][0.111985] Memory create - strides [CORE:I][0.111972] Memory desc init by Stride [memory] [CORE:I][0.111978] Memory created [memory] [API:I][0.112001] Memory create [CORE:V0][0.111990] Memory desc init by tag [memory] [CORE:I][0.111996] Memory created [memory] [API:I][0.112020] Memory create [CORE:V0][0.112007] Memory desc init by tag [memory] [CORE:I][0.112013] Memory created [memory] [API:I][0.112036] Memory create - strides [CORE:I][0.112022] Memory desc init by Stride [memory] [CORE:I][0.112028] Memory created [memory] [API:I][0.112051] Memory create [CORE:V0][0.112037] Memory desc init by tag [memory] [CORE:I][0.112042] Memory created [memory] [API:I][0.112067] Memory create [CORE:V0][0.112055] Memory desc init by tag [memory] [CORE:I][0.112061] Memory created [memory] [API:I][0.112084] Memory create - strides [CORE:I][0.112071] Memory desc init by Stride [memory] [CORE:I][0.112077] Memory created [memory] [API:I][0.112099] Memory create [CORE:V0][0.112087] Memory desc init by tag [memory] [CORE:I][0.112093] Memory created [memory] [API:I][0.097351] CPU Engine create [CORE:V0][0.112174] CPU Engine created [engine] [CORE:I][0.112178] CPU Engine created [cpu/engine] [API:I][0.097364] CPU Stream create [CORE:I][0.111787] CPU Stream created [stream] [CORE:V0][0.111788] CPU Stream created [cpu/stream] [API:I][0.097382] matmul desc create - no bias [CORE:I][0.112052] matmul desc init [matmul] [API:I][0.097397] matmul primitive_desc create - attr [PROF:I][0.111852] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00234,ms [API:I][0.097411] matmul primitive create [CORE:I][0.112040] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.112045] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.097321] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.097851] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.53ms graph_exe_count=-1 weight_address=0x70e079ff5040 [PROF:I][0.112422] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.557949,ms [API:I][0.097990] matmul desc create - no bias [CORE:I][0.112661] matmul desc init [matmul] [API:I][0.098005] matmul primitive_desc create - attr [PROF:I][0.112459] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00169,ms [API:I][0.098020] matmul primitive create [CORE:I][0.112648] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.112653] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.097927] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.098073] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.146ms graph_exe_count=-1 weight_address=0x2d1ae3c0 [PROF:I][0.112643] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.170896,ms [API:I][0.098219] matmul desc create - no bias [CORE:I][0.112889] matmul desc init [matmul] [API:I][0.098231] matmul primitive_desc create - attr [PROF:I][0.112685] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00153,ms [API:I][0.098245] matmul primitive create [CORE:I][0.112873] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.112878] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.098151] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.098286] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.135ms graph_exe_count=-1 weight_address=0x2e1ae400 [PROF:I][0.112856] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.158805,ms [PROF:V0][0.098419] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.06787,ms [CORE:I][0.112845] CPU Stream deleted [stream] [CORE:I][0.113251] CPU Engine deleted [engine] [API:I][0.113515] Memory create [CORE:V0][0.113504] Memory desc init by tag [memory] [CORE:I][0.113512] Memory created [memory] [API:I][0.113535] Memory create - strides [CORE:I][0.113523] Memory desc init by Stride [memory] [CORE:I][0.113528] Memory created [memory] [API:I][0.113549] Memory create [CORE:V0][0.113534] Memory desc init by tag [memory] [CORE:I][0.113540] Memory created [memory] [API:I][0.113471] matmul desc create - no bias [CORE:I][0.113468] matmul desc init [matmul] [API:I][0.113485] matmul primitive_desc create - attr [PROF:I][0.113268] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00252,ms [API:I][0.113501] matmul primitive create [CORE:I][0.113457] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.113461] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.098733] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.099289] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.557ms graph_exe_count=-1 weight_address=0x70e07dff6040 [PROF:I][0.113859] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.577989,ms [API:I][0.114261] Memory create [CORE:V0][0.114249] Memory desc init by tag [memory] [CORE:I][0.114255] Memory created [memory] [API:I][0.114278] Memory create - strides [CORE:I][0.114265] Memory desc init by Stride [memory] [CORE:I][0.114270] Memory created [memory] [API:I][0.114293] Memory create [CORE:V0][0.114282] Memory desc init by tag [memory] [CORE:I][0.114287] Memory created [memory] [API:I][0.114218] matmul desc create - no bias [CORE:I][0.114218] matmul desc init [matmul] [API:I][0.114235] matmul primitive_desc create - attr [PROF:I][0.114018] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00208,ms [API:I][0.114251] matmul primitive create [CORE:I][0.114206] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.114210] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.099481] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.101233] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.751ms graph_exe_count=-1 weight_address=0x70e081ff7040 [PROF:I][0.115804] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.77388,ms [API:I][0.116171] Memory create [CORE:V0][0.116159] Memory desc init by tag [memory] [CORE:I][0.116166] Memory created [memory] [API:I][0.116188] Memory create - strides [CORE:I][0.116175] Memory desc init by Stride [memory] [CORE:I][0.116181] Memory created [memory] [API:I][0.116204] Memory create [CORE:V0][0.116190] Memory desc init by tag [memory] [CORE:I][0.116196] Memory created [memory] [API:I][0.116220] Memory create [CORE:V0][0.116206] Memory desc init by tag [memory] [CORE:I][0.116213] Memory created [memory] [API:I][0.116240] Memory create [CORE:V0][0.116227] Memory desc init by tag [memory] [CORE:I][0.116233] Memory created [memory] [API:I][0.116166] matmul desc create - no bias [CORE:I][0.116165] matmul desc init [matmul] [API:I][0.116184] matmul primitive_desc create - attr [PROF:I][0.115967] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.0025,ms [API:I][0.116201] matmul primitive create [CORE:I][0.116160] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.116165] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.101440] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.103210] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.771ms graph_exe_count=-1 weight_address=0x70e08fff8040 [PROF:I][0.117782] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.79773,ms [API:I][0.118154] Memory create [CORE:V0][0.118142] Memory desc init by tag [memory] [CORE:I][0.118148] Memory created [memory] [API:I][0.118171] Memory create - strides [CORE:I][0.118157] Memory desc init by Stride [memory] [CORE:I][0.118163] Memory created [memory] [API:I][0.118185] Memory create [CORE:V0][0.118172] Memory desc init by tag [memory] [CORE:I][0.118178] Memory created [memory] [API:I][0.118109] matmul desc create - no bias [CORE:I][0.118107] matmul desc init [matmul] [API:I][0.118123] matmul primitive_desc create - attr [PROF:I][0.117906] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00242,ms [API:I][0.118140] matmul primitive create [CORE:I][0.118096] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.118100] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.103371] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.104970] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.598ms graph_exe_count=-1 weight_address=0x70e09dff9040 [PROF:I][0.119541] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.62137,ms [API:I][0.120029] Memory create [CORE:V0][0.120020] Memory desc init by tag [memory] [CORE:I][0.120027] Memory created [memory] [API:I][0.120049] Memory create - strides [CORE:I][0.120036] Memory desc init by Stride [memory] [CORE:I][0.120042] Memory created [memory] [API:I][0.120065] Memory create [CORE:V0][0.120051] Memory desc init by tag [memory] [CORE:I][0.120055] Memory created [memory] [API:I][0.120080] Memory create [CORE:V0][0.120065] Memory desc init by tag [memory] [CORE:I][0.120071] Memory created [memory] [API:I][0.120095] Memory create - strides [CORE:I][0.120084] Memory desc init by Stride [memory] [CORE:I][0.120090] Memory created [memory] [API:I][0.120113] Memory create [CORE:V0][0.120100] Memory desc init by tag [memory] [CORE:I][0.120106] Memory created [memory] [API:I][0.120130] Memory create [CORE:V0][0.120117] Memory desc init by tag [memory] [CORE:I][0.120123] Memory created [memory] [API:I][0.120145] Memory create - strides [CORE:I][0.120131] Memory desc init by Stride [memory] [CORE:I][0.120137] Memory created [memory] [API:I][0.120159] Memory create [CORE:V0][0.120145] Memory desc init by tag [memory] [CORE:I][0.120150] Memory created [memory] [API:I][0.105406] CPU Engine create [CORE:V0][0.120230] CPU Engine created [engine] [CORE:I][0.120236] CPU Engine created [cpu/engine] [API:I][0.105423] CPU Stream create [CORE:I][0.119847] CPU Stream created [stream] [CORE:V0][0.119847] CPU Stream created [cpu/stream] [API:I][0.105443] matmul desc create - no bias [CORE:I][0.120114] matmul desc init [matmul] [API:I][0.105458] matmul primitive_desc create - attr [PROF:I][0.119913] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00252,ms [API:I][0.105474] matmul primitive create [CORE:I][0.120102] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.120107] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.105380] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.105895] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.515ms graph_exe_count=-1 weight_address=0x70e047ff0040 [PROF:I][0.120465] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.538817,ms [API:I][0.106028] matmul desc create - no bias [CORE:I][0.120700] matmul desc init [matmul] [API:I][0.106045] matmul primitive_desc create - attr [PROF:I][0.120498] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00191,ms [API:I][0.106058] matmul primitive create [CORE:I][0.120686] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.120690] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.105962] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.106117] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.155ms graph_exe_count=-1 weight_address=0x2f1b64c0 [PROF:I][0.120688] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.177316,ms [API:I][0.106252] matmul desc create - no bias [CORE:I][0.120924] matmul desc init [matmul] [API:I][0.106265] matmul primitive_desc create - attr [PROF:I][0.120720] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00166,ms [API:I][0.106280] matmul primitive create [CORE:I][0.120908] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.120913] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.106187] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.106328] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.142ms graph_exe_count=-1 weight_address=0x301b6500 [PROF:I][0.120897] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.164716,ms [PROF:V0][0.106457] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.05005,ms [CORE:I][0.120881] CPU Stream deleted [stream] [CORE:I][0.121287] CPU Engine deleted [engine] [API:I][0.121556] Memory create [CORE:V0][0.121545] Memory desc init by tag [memory] [CORE:I][0.121552] Memory created [memory] [API:I][0.121575] Memory create - strides [CORE:I][0.121563] Memory desc init by Stride [memory] [CORE:I][0.121568] Memory created [memory] [API:I][0.121591] Memory create [CORE:V0][0.121578] Memory desc init by tag [memory] [CORE:I][0.121581] Memory created [memory] [API:I][0.121511] matmul desc create - no bias [CORE:I][0.121508] matmul desc init [matmul] [API:I][0.121524] matmul primitive_desc create - attr [PROF:I][0.121307] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00255,ms [API:I][0.121540] matmul primitive create [CORE:I][0.121496] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.121500] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.106772] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.107253] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.482ms graph_exe_count=-1 weight_address=0x70e04bff1040 [PROF:I][0.121823] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.503276,ms [API:I][0.122230] Memory create [CORE:V0][0.122218] Memory desc init by tag [memory] [CORE:I][0.122224] Memory created [memory] [API:I][0.122247] Memory create - strides [CORE:I][0.122235] Memory desc init by Stride [memory] [CORE:I][0.122240] Memory created [memory] [API:I][0.122264] Memory create [CORE:V0][0.122250] Memory desc init by tag [memory] [CORE:I][0.122257] Memory created [memory] [API:I][0.122187] matmul desc create - no bias [CORE:I][0.122184] matmul desc init [matmul] [API:I][0.122197] matmul primitive_desc create - attr [PROF:I][0.121977] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00157,ms [API:I][0.122209] matmul primitive create [CORE:I][0.122164] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.122169] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.107443] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.109139] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.696ms graph_exe_count=-1 weight_address=0x70e04fff2040 [PROF:I][0.123711] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.72248,ms [API:I][0.124074] Memory create [CORE:V0][0.124061] Memory desc init by tag [memory] [CORE:I][0.124068] Memory created [memory] [API:I][0.124090] Memory create - strides [CORE:I][0.124077] Memory desc init by Stride [memory] [CORE:I][0.124082] Memory created [memory] [API:I][0.124106] Memory create [CORE:V0][0.124094] Memory desc init by tag [memory] [CORE:I][0.124099] Memory created [memory] [API:I][0.124124] Memory create [CORE:V0][0.124111] Memory desc init by tag [memory] [CORE:I][0.124117] Memory created [memory] [API:I][0.124141] Memory create [CORE:V0][0.124127] Memory desc init by tag [memory] [CORE:I][0.124134] Memory created [memory] [API:I][0.124066] matmul desc create - no bias [CORE:I][0.124063] matmul desc init [matmul] [API:I][0.124080] matmul primitive_desc create - attr [PROF:I][0.123862] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.002231,ms [API:I][0.124096] matmul primitive create [CORE:I][0.124055] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.124060] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.109336] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.111189] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.855ms graph_exe_count=-1 weight_address=0x70e05dff3040 [PROF:I][0.125761] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.88147,ms [API:I][0.126132] Memory create [CORE:V0][0.126120] Memory desc init by tag [memory] [CORE:I][0.126127] Memory created [memory] [API:I][0.126150] Memory create - strides [CORE:I][0.126135] Memory desc init by Stride [memory] [CORE:I][0.126142] Memory created [memory] [API:I][0.126164] Memory create [CORE:V0][0.126150] Memory desc init by tag [memory] [CORE:I][0.126155] Memory created [memory] [API:I][0.126086] matmul desc create - no bias [CORE:I][0.126085] matmul desc init [matmul] [API:I][0.126104] matmul primitive_desc create - attr [PROF:I][0.125887] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00232,ms [API:I][0.126120] matmul primitive create [CORE:I][0.126076] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.126080] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.111351] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.112921] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.569ms graph_exe_count=-1 weight_address=0x70e06bff4040 [PROF:I][0.127494] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.59382,ms [API:I][0.127956] Memory create [CORE:V0][0.127945] Memory desc init by tag [memory] [CORE:I][0.127963] Memory created [memory] [API:I][0.127985] Memory create - strides [CORE:I][0.127974] Memory desc init by Stride [memory] [CORE:I][0.127979] Memory created [memory] [API:I][0.128001] Memory create [CORE:V0][0.127987] Memory desc init by tag [memory] [CORE:I][0.127993] Memory created [memory] [API:I][0.128019] Memory create [CORE:V0][0.128006] Memory desc init by tag [memory] [CORE:I][0.128012] Memory created [memory] [API:I][0.128036] Memory create - strides [CORE:I][0.128022] Memory desc init by Stride [memory] [CORE:I][0.128028] Memory created [memory] [API:I][0.128051] Memory create [CORE:V0][0.128038] Memory desc init by tag [memory] [CORE:I][0.128044] Memory created [memory] [API:I][0.128068] Memory create [CORE:V0][0.128055] Memory desc init by tag [memory] [CORE:I][0.128060] Memory created [memory] [API:I][0.128082] Memory create - strides [CORE:I][0.128071] Memory desc init by Stride [memory] [CORE:I][0.128076] Memory created [memory] [API:I][0.128098] Memory create [CORE:V0][0.128084] Memory desc init by tag [memory] [CORE:I][0.128090] Memory created [memory] [API:I][0.113347] CPU Engine create [CORE:V0][0.128170] CPU Engine created [engine] [CORE:I][0.128177] CPU Engine created [cpu/engine] [API:I][0.113364] CPU Stream create [CORE:I][0.127788] CPU Stream created [stream] [CORE:V0][0.127789] CPU Stream created [cpu/stream] [API:I][0.113383] matmul desc create - no bias [CORE:I][0.128053] matmul desc init [matmul] [API:I][0.113397] matmul primitive_desc create - attr [PROF:I][0.127853] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00282,ms [API:I][0.113414] matmul primitive create [CORE:I][0.128042] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.128047] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.113320] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.113823] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.503ms graph_exe_count=-1 weight_address=0x70e015feb040 [PROF:I][0.128394] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.527977,ms [API:I][0.113957] matmul desc create - no bias [CORE:I][0.128629] matmul desc init [matmul] [API:I][0.113972] matmul primitive_desc create - attr [PROF:I][0.128425] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00174,ms [API:I][0.113986] matmul primitive create [CORE:I][0.128615] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.128620] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.113894] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.114038] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.144ms graph_exe_count=-1 weight_address=0x311be5c0 [PROF:I][0.128608] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.168475,ms [API:I][0.114173] matmul desc create - no bias [CORE:I][0.128844] matmul desc init [matmul] [API:I][0.114187] matmul primitive_desc create - attr [PROF:I][0.128640] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00158,ms [API:I][0.114211] matmul primitive create [CORE:I][0.128839] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.128844] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.114118] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.114252] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.134ms graph_exe_count=-1 weight_address=0x321be600 [PROF:I][0.128822] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.158765,ms [PROF:V0][0.114383] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.03613,ms [CORE:I][0.128809] CPU Stream deleted [stream] [CORE:I][0.129215] CPU Engine deleted [engine] [API:I][0.129475] Memory create [CORE:V0][0.129464] Memory desc init by tag [memory] [CORE:I][0.129471] Memory created [memory] [API:I][0.129494] Memory create - strides [CORE:I][0.129482] Memory desc init by Stride [memory] [CORE:I][0.129488] Memory created [memory] [API:I][0.129510] Memory create [CORE:V0][0.129496] Memory desc init by tag [memory] [CORE:I][0.129502] Memory created [memory] [API:I][0.129433] matmul desc create - no bias [CORE:I][0.129433] matmul desc init [matmul] [API:I][0.129449] matmul primitive_desc create - attr [PROF:I][0.129232] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00236,ms [API:I][0.129466] matmul primitive create [CORE:I][0.129423] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.129428] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.114702] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.115251] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.549ms graph_exe_count=-1 weight_address=0x70e019fec040 [PROF:I][0.129822] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.574609,ms [API:I][0.130224] Memory create [CORE:V0][0.130211] Memory desc init by tag [memory] [CORE:I][0.130218] Memory created [memory] [API:I][0.130241] Memory create - strides [CORE:I][0.130227] Memory desc init by Stride [memory] [CORE:I][0.130232] Memory created [memory] [API:I][0.130254] Memory create [CORE:V0][0.130241] Memory desc init by tag [memory] [CORE:I][0.130246] Memory created [memory] [API:I][0.130177] matmul desc create - no bias [CORE:I][0.130175] matmul desc init [matmul] [API:I][0.130192] matmul primitive_desc create - attr [PROF:I][0.129974] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00211,ms [API:I][0.130207] matmul primitive create [CORE:I][0.130163] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.130167] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.115439] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.117148] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.709ms graph_exe_count=-1 weight_address=0x70e01dfed040 [PROF:I][0.131719] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.73185,ms [API:I][0.132080] Memory create [CORE:V0][0.132068] Memory desc init by tag [memory] [CORE:I][0.132075] Memory created [memory] [API:I][0.132097] Memory create - strides [CORE:I][0.132083] Memory desc init by Stride [memory] [CORE:I][0.132089] Memory created [memory] [API:I][0.132111] Memory create [CORE:V0][0.132097] Memory desc init by tag [memory] [CORE:I][0.132105] Memory created [memory] [API:I][0.132128] Memory create [CORE:V0][0.132113] Memory desc init by tag [memory] [CORE:I][0.132118] Memory created [memory] [API:I][0.132144] Memory create [CORE:V0][0.132131] Memory desc init by tag [memory] [CORE:I][0.132137] Memory created [memory] [API:I][0.132070] matmul desc create - no bias [CORE:I][0.132069] matmul desc init [matmul] [API:I][0.132088] matmul primitive_desc create - attr [PROF:I][0.131870] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.0022,ms [API:I][0.132105] matmul primitive create [CORE:I][0.132062] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.132067] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.117342] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.119068] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.727ms graph_exe_count=-1 weight_address=0x70e02bfee040 [PROF:I][0.133640] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.75364,ms [API:I][0.134015] Memory create [CORE:V0][0.134002] Memory desc init by tag [memory] [CORE:I][0.134007] Memory created [memory] [API:I][0.134030] Memory create - strides [CORE:I][0.134019] Memory desc init by Stride [memory] [CORE:I][0.134026] Memory created [memory] [API:I][0.134048] Memory create [CORE:V0][0.134034] Memory desc init by tag [memory] [CORE:I][0.134040] Memory created [memory] [API:I][0.133970] matmul desc create - no bias [CORE:I][0.133969] matmul desc init [matmul] [API:I][0.133986] matmul primitive_desc create - attr [PROF:I][0.133767] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00181,ms [API:I][0.133999] matmul primitive create [CORE:I][0.133954] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.133959] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.119233] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.120794] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.562ms graph_exe_count=-1 weight_address=0x70e039fef040 [PROF:I][0.135366] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.58786,ms [API:I][0.135853] Memory create [CORE:V0][0.135842] Memory desc init by tag [memory] [CORE:I][0.135848] Memory created [memory] [API:I][0.135871] Memory create - strides [CORE:I][0.135858] Memory desc init by Stride [memory] [CORE:I][0.135864] Memory created [memory] [API:I][0.135886] Memory create [CORE:V0][0.135872] Memory desc init by tag [memory] [CORE:I][0.135877] Memory created [memory] [API:I][0.135903] Memory create [CORE:V0][0.135891] Memory desc init by tag [memory] [CORE:I][0.135896] Memory created [memory] [API:I][0.135919] Memory create - strides [CORE:I][0.135907] Memory desc init by Stride [memory] [CORE:I][0.135915] Memory created [memory] [API:I][0.135938] Memory create [CORE:V0][0.135925] Memory desc init by tag [memory] [CORE:I][0.135931] Memory created [memory] [API:I][0.135955] Memory create [CORE:V0][0.135943] Memory desc init by tag [memory] [CORE:I][0.135959] Memory created [memory] [API:I][0.135982] Memory create - strides [CORE:I][0.135969] Memory desc init by Stride [memory] [CORE:I][0.135975] Memory created [memory] [API:I][0.135998] Memory create [CORE:V0][0.135985] Memory desc init by tag [memory] [CORE:I][0.135991] Memory created [memory] [API:I][0.121249] CPU Engine create [CORE:V0][0.136073] CPU Engine created [engine] [CORE:I][0.136080] CPU Engine created [cpu/engine] [API:I][0.121268] CPU Stream create [CORE:I][0.135693] CPU Stream created [stream] [CORE:V0][0.135694] CPU Stream created [cpu/stream] [API:I][0.121289] matmul desc create - no bias [CORE:I][0.135960] matmul desc init [matmul] [API:I][0.121308] matmul primitive_desc create - attr [PROF:I][0.135763] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.0025,ms [API:I][0.121324] matmul primitive create [CORE:I][0.135954] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.135959] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.121235] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.121745] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.51ms graph_exe_count=-1 weight_address=0x70dfe3fe6040 [PROF:I][0.136316] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.537968,ms [API:I][0.121881] matmul desc create - no bias [CORE:I][0.136552] matmul desc init [matmul] [API:I][0.121896] matmul primitive_desc create - attr [PROF:I][0.136350] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00233,ms [API:I][0.121910] matmul primitive create [CORE:I][0.136539] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.136544] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.121818] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.121965] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.147ms graph_exe_count=-1 weight_address=0x331c66c0 [PROF:I][0.136535] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.171286,ms [API:I][0.122098] matmul desc create - no bias [CORE:I][0.136769] matmul desc init [matmul] [API:I][0.122110] matmul primitive_desc create - attr [PROF:I][0.136563] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00148,ms [API:I][0.122124] matmul primitive create [CORE:I][0.136751] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.136756] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.122029] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.122178] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.149ms graph_exe_count=-1 weight_address=0x341c6700 [PROF:I][0.136748] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.172036,ms [PROF:V0][0.122311] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.06104,ms [CORE:I][0.136736] CPU Stream deleted [stream] [CORE:I][0.137143] CPU Engine deleted [engine] [API:I][0.137407] Memory create [CORE:V0][0.137395] Memory desc init by tag [memory] [CORE:I][0.137400] Memory created [memory] [API:I][0.137425] Memory create - strides [CORE:I][0.137412] Memory desc init by Stride [memory] [CORE:I][0.137417] Memory created [memory] [API:I][0.137440] Memory create [CORE:V0][0.137426] Memory desc init by tag [memory] [CORE:I][0.137433] Memory created [memory] [API:I][0.137365] matmul desc create - no bias [CORE:I][0.137365] matmul desc init [matmul] [API:I][0.137381] matmul primitive_desc create - attr [PROF:I][0.137164] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00246,ms [API:I][0.137397] matmul primitive create [CORE:I][0.137353] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.137357] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.122629] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.123135] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.507ms graph_exe_count=-1 weight_address=0x70dfe7fe7040 [PROF:I][0.137705] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.527637,ms [API:I][0.138116] Memory create [CORE:V0][0.138103] Memory desc init by tag [memory] [CORE:I][0.138110] Memory created [memory] [API:I][0.138132] Memory create - strides [CORE:I][0.138120] Memory desc init by Stride [memory] [CORE:I][0.138125] Memory created [memory] [API:I][0.138147] Memory create [CORE:V0][0.138133] Memory desc init by tag [memory] [CORE:I][0.138139] Memory created [memory] [API:I][0.138069] matmul desc create - no bias [CORE:I][0.138069] matmul desc init [matmul] [API:I][0.138085] matmul primitive_desc create - attr [PROF:I][0.137867] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.0022,ms [API:I][0.138101] matmul primitive create [CORE:I][0.138058] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.138064] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.123338] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.125152] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.814ms graph_exe_count=-1 weight_address=0x70dfebfe8040 [PROF:I][0.139723] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.84041,ms [API:I][0.140084] Memory create [CORE:V0][0.140072] Memory desc init by tag [memory] [CORE:I][0.140079] Memory created [memory] [API:I][0.140101] Memory create - strides [CORE:I][0.140087] Memory desc init by Stride [memory] [CORE:I][0.140093] Memory created [memory] [API:I][0.140116] Memory create [CORE:V0][0.140104] Memory desc init by tag [memory] [CORE:I][0.140110] Memory created [memory] [API:I][0.140134] Memory create [CORE:V0][0.140122] Memory desc init by tag [memory] [CORE:I][0.140128] Memory created [memory] [API:I][0.140154] Memory create [CORE:V0][0.140141] Memory desc init by tag [memory] [CORE:I][0.140146] Memory created [memory] [API:I][0.140080] matmul desc create - no bias [CORE:I][0.140079] matmul desc init [matmul] [API:I][0.140098] matmul primitive_desc create - attr [PROF:I][0.139880] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00237,ms [API:I][0.140114] matmul primitive create [CORE:I][0.140071] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.140076] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.125350] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.127124] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.775ms graph_exe_count=-1 weight_address=0x70dff9fe9040 [PROF:I][0.141696] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.80098,ms [API:I][0.142064] Memory create [CORE:V0][0.142052] Memory desc init by tag [memory] [CORE:I][0.142059] Memory created [memory] [API:I][0.142082] Memory create - strides [CORE:I][0.142068] Memory desc init by Stride [memory] [CORE:I][0.142073] Memory created [memory] [API:I][0.142096] Memory create [CORE:V0][0.142082] Memory desc init by tag [memory] [CORE:I][0.142089] Memory created [memory] [API:I][0.142019] matmul desc create - no bias [CORE:I][0.142017] matmul desc init [matmul] [API:I][0.142035] matmul primitive_desc create - attr [PROF:I][0.141815] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00155,ms [API:I][0.142047] matmul primitive create [CORE:I][0.142003] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.142008] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.127283] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.128851] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.569ms graph_exe_count=-1 weight_address=0x70e007fea040 [PROF:I][0.143423] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.59593,ms [API:I][0.143892] Memory create [CORE:V0][0.143883] Memory desc init by tag [memory] [CORE:I][0.143890] Memory created [memory] [API:I][0.143913] Memory create - strides [CORE:I][0.143898] Memory desc init by Stride [memory] [CORE:I][0.143901] Memory created [memory] [API:I][0.143923] Memory create [CORE:V0][0.143911] Memory desc init by tag [memory] [CORE:I][0.143917] Memory created [memory] [API:I][0.143940] Memory create [CORE:V0][0.143927] Memory desc init by tag [memory] [CORE:I][0.143933] Memory created [memory] [API:I][0.143955] Memory create - strides [CORE:I][0.143941] Memory desc init by Stride [memory] [CORE:I][0.143957] Memory created [memory] [API:I][0.143980] Memory create [CORE:V0][0.143967] Memory desc init by tag [memory] [CORE:I][0.143973] Memory created [memory] [API:I][0.143997] Memory create [CORE:V0][0.143984] Memory desc init by tag [memory] [CORE:I][0.143991] Memory created [memory] [API:I][0.144014] Memory create - strides [CORE:I][0.144001] Memory desc init by Stride [memory] [CORE:I][0.144009] Memory created [memory] [API:I][0.144032] Memory create [CORE:V0][0.144019] Memory desc init by tag [memory] [CORE:I][0.144026] Memory created [memory] [API:I][0.129283] CPU Engine create [CORE:V0][0.144107] CPU Engine created [engine] [CORE:I][0.144113] CPU Engine created [cpu/engine] [API:I][0.129301] CPU Stream create [CORE:I][0.143725] CPU Stream created [stream] [CORE:V0][0.143726] CPU Stream created [cpu/stream] [API:I][0.129323] matmul desc create - no bias [CORE:I][0.143995] matmul desc init [matmul] [API:I][0.129340] matmul primitive_desc create - attr [PROF:I][0.143795] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.002601,ms [API:I][0.129356] matmul primitive create [CORE:I][0.143986] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.143991] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.129267] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.129781] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.514ms graph_exe_count=-1 weight_address=0x70db5fdfe040 [PROF:I][0.144353] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.543578,ms [API:I][0.129918] matmul desc create - no bias [CORE:I][0.144589] matmul desc init [matmul] [API:I][0.129931] matmul primitive_desc create - attr [PROF:I][0.144384] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00172,ms [API:I][0.129945] matmul primitive create [CORE:I][0.144573] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.144577] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.129849] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.129985] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.137ms graph_exe_count=-1 weight_address=0x351ce7c0 [PROF:I][0.144555] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.158006,ms [API:I][0.130121] matmul desc create - no bias [CORE:I][0.144792] matmul desc init [matmul] [API:I][0.130134] matmul primitive_desc create - attr [PROF:I][0.144588] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00153,ms [API:I][0.130148] matmul primitive create [CORE:I][0.144777] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.144782] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.130056] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.130206] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.15ms graph_exe_count=-1 weight_address=0x361ce800 [PROF:I][0.144777] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.175546,ms [PROF:V0][0.130341] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.05713,ms [CORE:I][0.144766] CPU Stream deleted [stream] [CORE:I][0.145172] CPU Engine deleted [engine] [API:I][0.145432] Memory create [CORE:V0][0.145421] Memory desc init by tag [memory] [CORE:I][0.145427] Memory created [memory] [API:I][0.145451] Memory create - strides [CORE:I][0.145437] Memory desc init by Stride [memory] [CORE:I][0.145443] Memory created [memory] [API:I][0.145465] Memory create [CORE:V0][0.145452] Memory desc init by tag [memory] [CORE:I][0.145457] Memory created [memory] [API:I][0.145389] matmul desc create - no bias [CORE:I][0.145386] matmul desc init [matmul] [API:I][0.145404] matmul primitive_desc create - attr [PROF:I][0.145185] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00196,ms [API:I][0.145417] matmul primitive create [CORE:I][0.145373] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.145378] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.130652] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.131165] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.514ms graph_exe_count=-1 weight_address=0x70dfc3fe3040 [PROF:I][0.145736] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.539548,ms [API:I][0.146155] Memory create [CORE:V0][0.146143] Memory desc init by tag [memory] [CORE:I][0.146149] Memory created [memory] [API:I][0.146171] Memory create - strides [CORE:I][0.146158] Memory desc init by Stride [memory] [CORE:I][0.146163] Memory created [memory] [API:I][0.146186] Memory create [CORE:V0][0.146172] Memory desc init by tag [memory] [CORE:I][0.146179] Memory created [memory] [API:I][0.146109] matmul desc create - no bias [CORE:I][0.146107] matmul desc init [matmul] [API:I][0.146123] matmul primitive_desc create - attr [PROF:I][0.145904] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.0021,ms [API:I][0.146137] matmul primitive create [CORE:I][0.146092] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.146097] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.131369] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.133164] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.796ms graph_exe_count=-1 weight_address=0x70db63dff040 [PROF:I][0.147735] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.81875,ms [API:I][0.148098] Memory create [CORE:V0][0.148085] Memory desc init by tag [memory] [CORE:I][0.148092] Memory created [memory] [API:I][0.148114] Memory create - strides [CORE:I][0.148102] Memory desc init by Stride [memory] [CORE:I][0.148107] Memory created [memory] [API:I][0.148129] Memory create [CORE:V0][0.148116] Memory desc init by tag [memory] [CORE:I][0.148122] Memory created [memory] [API:I][0.148146] Memory create [CORE:V0][0.148134] Memory desc init by tag [memory] [CORE:I][0.148139] Memory created [memory] [API:I][0.148165] Memory create [CORE:V0][0.148153] Memory desc init by tag [memory] [CORE:I][0.148160] Memory created [memory] [API:I][0.148095] matmul desc create - no bias [CORE:I][0.148093] matmul desc init [matmul] [API:I][0.148110] matmul primitive_desc create - attr [PROF:I][0.147891] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00191,ms [API:I][0.148124] matmul primitive create [CORE:I][0.148079] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.148084] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.133358] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.135186] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.828ms graph_exe_count=-1 weight_address=0x70dfc7fe4040 [PROF:I][0.149757] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.85401,ms [API:I][0.150127] Memory create [CORE:V0][0.150114] Memory desc init by tag [memory] [CORE:I][0.150121] Memory created [memory] [API:I][0.150144] Memory create - strides [CORE:I][0.150130] Memory desc init by Stride [memory] [CORE:I][0.150135] Memory created [memory] [API:I][0.150159] Memory create [CORE:V0][0.150147] Memory desc init by tag [memory] [CORE:I][0.150153] Memory created [memory] [API:I][0.150084] matmul desc create - no bias [CORE:I][0.150083] matmul desc init [matmul] [API:I][0.150100] matmul primitive_desc create - attr [PROF:I][0.149880] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00162,ms [API:I][0.150112] matmul primitive create [CORE:I][0.150066] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.150070] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.135341] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.136912] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.57ms graph_exe_count=-1 weight_address=0x70dfd5fe5040 [PROF:I][0.151482] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.59172,ms [API:I][0.151964] Memory create [CORE:V0][0.151963] Memory desc init by tag [memory] [CORE:I][0.151970] Memory created [memory] [API:I][0.151993] Memory create - strides [CORE:I][0.151983] Memory desc init by Stride [memory] [CORE:I][0.151988] Memory created [memory] [API:I][0.152011] Memory create [CORE:V0][0.151996] Memory desc init by tag [memory] [CORE:I][0.152001] Memory created [memory] [API:I][0.152026] Memory create [CORE:V0][0.152014] Memory desc init by tag [memory] [CORE:I][0.152019] Memory created [memory] [API:I][0.152042] Memory create - strides [CORE:I][0.152031] Memory desc init by Stride [memory] [CORE:I][0.152037] Memory created [memory] [API:I][0.152061] Memory create [CORE:V0][0.152047] Memory desc init by tag [memory] [CORE:I][0.152053] Memory created [memory] [API:I][0.152077] Memory create [CORE:V0][0.152064] Memory desc init by tag [memory] [CORE:I][0.152070] Memory created [memory] [API:I][0.152093] Memory create - strides [CORE:I][0.152079] Memory desc init by Stride [memory] [CORE:I][0.152086] Memory created [memory] [API:I][0.152108] Memory create [CORE:V0][0.152097] Memory desc init by tag [memory] [CORE:I][0.152103] Memory created [memory] [API:I][0.137360] CPU Engine create [CORE:V0][0.152183] CPU Engine created [engine] [CORE:I][0.152189] CPU Engine created [cpu/engine] [API:I][0.137376] CPU Stream create [CORE:I][0.151801] CPU Stream created [stream] [CORE:V0][0.151802] CPU Stream created [cpu/stream] [API:I][0.137397] matmul desc create - no bias [CORE:I][0.152069] matmul desc init [matmul] [API:I][0.137413] matmul primitive_desc create - attr [PROF:I][0.151870] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00277,ms [API:I][0.137430] matmul primitive create [CORE:I][0.152059] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.152063] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.137337] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.137871] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.535ms graph_exe_count=-1 weight_address=0x70db2ddf9040 [PROF:I][0.152442] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.559208,ms [API:I][0.138006] matmul desc create - no bias [CORE:I][0.152677] matmul desc init [matmul] [API:I][0.138022] matmul primitive_desc create - attr [PROF:I][0.152475] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00199,ms [API:I][0.138035] matmul primitive create [CORE:I][0.152665] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.152671] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.137946] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.138104] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.158ms graph_exe_count=-1 weight_address=0x371d68c0 [PROF:I][0.152674] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.184416,ms [API:I][0.138239] matmul desc create - no bias [CORE:I][0.152910] matmul desc init [matmul] [API:I][0.138253] matmul primitive_desc create - attr [PROF:I][0.152707] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00168,ms [API:I][0.138267] matmul primitive create [CORE:I][0.152895] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.152899] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.138169] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.138326] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.156ms graph_exe_count=-1 weight_address=0x381d6900 [PROF:I][0.152895] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.175976,ms [PROF:V0][0.138456] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.09717,ms [CORE:I][0.152882] CPU Stream deleted [stream] [CORE:I][0.153290] CPU Engine deleted [engine] [API:I][0.153565] Memory create [CORE:V0][0.153555] Memory desc init by tag [memory] [CORE:I][0.153562] Memory created [memory] [API:I][0.153585] Memory create - strides [CORE:I][0.153573] Memory desc init by Stride [memory] [CORE:I][0.153579] Memory created [memory] [API:I][0.153601] Memory create [CORE:V0][0.153587] Memory desc init by tag [memory] [CORE:I][0.153593] Memory created [memory] [API:I][0.153524] matmul desc create - no bias [CORE:I][0.153524] matmul desc init [matmul] [API:I][0.153541] matmul primitive_desc create - attr [PROF:I][0.153325] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.0026,ms [API:I][0.153558] matmul primitive create [CORE:I][0.153516] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.153521] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.138795] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.139322] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.527ms graph_exe_count=-1 weight_address=0x70db31dfa040 [PROF:I][0.153893] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.552858,ms [API:I][0.154309] Memory create [CORE:V0][0.154297] Memory desc init by tag [memory] [CORE:I][0.154303] Memory created [memory] [API:I][0.154326] Memory create - strides [CORE:I][0.154312] Memory desc init by Stride [memory] [CORE:I][0.154318] Memory created [memory] [API:I][0.154340] Memory create [CORE:V0][0.154326] Memory desc init by tag [memory] [CORE:I][0.154332] Memory created [memory] [API:I][0.154263] matmul desc create - no bias [CORE:I][0.154262] matmul desc init [matmul] [API:I][0.154280] matmul primitive_desc create - attr [PROF:I][0.154063] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.002491,ms [API:I][0.154296] matmul primitive create [CORE:I][0.154252] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.154257] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.139528] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.141279] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.751ms graph_exe_count=-1 weight_address=0x70db35dfb040 [PROF:I][0.155851] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.77452,ms [API:I][0.156211] Memory create [CORE:V0][0.156199] Memory desc init by tag [memory] [CORE:I][0.156205] Memory created [memory] [API:I][0.156228] Memory create - strides [CORE:I][0.156214] Memory desc init by Stride [memory] [CORE:I][0.156220] Memory created [memory] [API:I][0.156242] Memory create [CORE:V0][0.156229] Memory desc init by tag [memory] [CORE:I][0.156237] Memory created [memory] [API:I][0.156260] Memory create [CORE:V0][0.156245] Memory desc init by tag [memory] [CORE:I][0.156252] Memory created [memory] [API:I][0.156278] Memory create [CORE:V0][0.156263] Memory desc init by tag [memory] [CORE:I][0.156268] Memory created [memory] [API:I][0.156202] matmul desc create - no bias [CORE:I][0.156202] matmul desc init [matmul] [API:I][0.156220] matmul primitive_desc create - attr [PROF:I][0.156002] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.0022,ms [API:I][0.156236] matmul primitive create [CORE:I][0.156193] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.156198] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.141473] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.143241] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.77ms graph_exe_count=-1 weight_address=0x70db43dfc040 [PROF:I][0.157812] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.79556,ms [API:I][0.158182] Memory create [CORE:V0][0.158170] Memory desc init by tag [memory] [CORE:I][0.158177] Memory created [memory] [API:I][0.158199] Memory create - strides [CORE:I][0.158186] Memory desc init by Stride [memory] [CORE:I][0.158191] Memory created [memory] [API:I][0.158213] Memory create [CORE:V0][0.158200] Memory desc init by tag [memory] [CORE:I][0.158207] Memory created [memory] [API:I][0.158137] matmul desc create - no bias [CORE:I][0.158135] matmul desc init [matmul] [API:I][0.158151] matmul primitive_desc create - attr [PROF:I][0.157933] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00232,ms [API:I][0.158167] matmul primitive create [CORE:I][0.158124] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.158129] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.143400] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.144972] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.573ms graph_exe_count=-1 weight_address=0x70db51dfd040 [PROF:I][0.159544] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.59546,ms [API:I][0.160020] Memory create [CORE:V0][0.160010] Memory desc init by tag [memory] [CORE:I][0.160017] Memory created [memory] [API:I][0.160040] Memory create - strides [CORE:I][0.160027] Memory desc init by Stride [memory] [CORE:I][0.160032] Memory created [memory] [API:I][0.160055] Memory create [CORE:V0][0.160041] Memory desc init by tag [memory] [CORE:I][0.160046] Memory created [memory] [API:I][0.160071] Memory create [CORE:V0][0.160058] Memory desc init by tag [memory] [CORE:I][0.160064] Memory created [memory] [API:I][0.160089] Memory create - strides [CORE:I][0.160076] Memory desc init by Stride [memory] [CORE:I][0.160082] Memory created [memory] [API:I][0.160106] Memory create [CORE:V0][0.160093] Memory desc init by tag [memory] [CORE:I][0.160098] Memory created [memory] [API:I][0.160123] Memory create [CORE:V0][0.160111] Memory desc init by tag [memory] [CORE:I][0.160117] Memory created [memory] [API:I][0.160140] Memory create - strides [CORE:I][0.160127] Memory desc init by Stride [memory] [CORE:I][0.160133] Memory created [memory] [API:I][0.160155] Memory create [CORE:V0][0.160143] Memory desc init by tag [memory] [CORE:I][0.160152] Memory created [memory] [API:I][0.145408] CPU Engine create [CORE:V0][0.160231] CPU Engine created [engine] [CORE:I][0.160235] CPU Engine created [cpu/engine] [API:I][0.145423] CPU Stream create [CORE:I][0.159846] CPU Stream created [stream] [CORE:V0][0.159846] CPU Stream created [cpu/stream] [API:I][0.145442] matmul desc create - no bias [CORE:I][0.160114] matmul desc init [matmul] [API:I][0.145458] matmul primitive_desc create - attr [PROF:I][0.159913] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00226,ms [API:I][0.145472] matmul primitive create [CORE:I][0.160101] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.160106] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.145382] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.145894] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.513ms graph_exe_count=-1 weight_address=0x70dafbdf4040 [PROF:I][0.160466] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.541018,ms [API:I][0.146032] matmul desc create - no bias [CORE:I][0.160704] matmul desc init [matmul] [API:I][0.146047] matmul primitive_desc create - attr [PROF:I][0.160500] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.0017,ms [API:I][0.146061] matmul primitive create [CORE:I][0.160689] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.160694] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.145968] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.146125] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.157ms graph_exe_count=-1 weight_address=0x391de9c0 [PROF:I][0.160695] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.181956,ms [API:I][0.146256] matmul desc create - no bias [CORE:I][0.160928] matmul desc init [matmul] [API:I][0.146272] matmul primitive_desc create - attr [PROF:I][0.160726] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00158,ms [API:I][0.146286] matmul primitive create [CORE:I][0.160915] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.160919] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.146193] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.146333] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.14ms graph_exe_count=-1 weight_address=0x3a1dea00 [PROF:I][0.160903] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.163545,ms [PROF:V0][0.146463] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.05493,ms [CORE:I][0.160890] CPU Stream deleted [stream] [CORE:I][0.161295] CPU Engine deleted [engine] [API:I][0.161546] Memory create [CORE:V0][0.161535] Memory desc init by tag [memory] [CORE:I][0.161542] Memory created [memory] [API:I][0.161565] Memory create - strides [CORE:I][0.161550] Memory desc init by Stride [memory] [CORE:I][0.161554] Memory created [memory] [API:I][0.161576] Memory create [CORE:V0][0.161564] Memory desc init by tag [memory] [CORE:I][0.161569] Memory created [memory] [API:I][0.161501] matmul desc create - no bias [CORE:I][0.161499] matmul desc init [matmul] [API:I][0.161516] matmul primitive_desc create - attr [PROF:I][0.161299] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.002581,ms [API:I][0.161532] matmul primitive create [CORE:I][0.161488] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.161493] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.146767] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.147314] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.548ms graph_exe_count=-1 weight_address=0x70daffdf5040 [PROF:I][0.161886] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.573549,ms [API:I][0.162290] Memory create [CORE:V0][0.162278] Memory desc init by tag [memory] [CORE:I][0.162284] Memory created [memory] [API:I][0.162306] Memory create - strides [CORE:I][0.162294] Memory desc init by Stride [memory] [CORE:I][0.162301] Memory created [memory] [API:I][0.162325] Memory create [CORE:V0][0.162312] Memory desc init by tag [memory] [CORE:I][0.162317] Memory created [memory] [API:I][0.162247] matmul desc create - no bias [CORE:I][0.162246] matmul desc init [matmul] [API:I][0.162263] matmul primitive_desc create - attr [PROF:I][0.162045] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00228,ms [API:I][0.162279] matmul primitive create [CORE:I][0.162235] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.162240] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.147515] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.149318] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.802ms graph_exe_count=-1 weight_address=0x70db03df6040 [PROF:I][0.163890] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.83087,ms [API:I][0.164254] Memory create [CORE:V0][0.164241] Memory desc init by tag [memory] [CORE:I][0.164248] Memory created [memory] [API:I][0.164270] Memory create - strides [CORE:I][0.164256] Memory desc init by Stride [memory] [CORE:I][0.164262] Memory created [memory] [API:I][0.164285] Memory create [CORE:V0][0.164271] Memory desc init by tag [memory] [CORE:I][0.164278] Memory created [memory] [API:I][0.164301] Memory create [CORE:V0][0.164288] Memory desc init by tag [memory] [CORE:I][0.164293] Memory created [memory] [API:I][0.164319] Memory create [CORE:V0][0.164305] Memory desc init by tag [memory] [CORE:I][0.164311] Memory created [memory] [API:I][0.164245] matmul desc create - no bias [CORE:I][0.164243] matmul desc init [matmul] [API:I][0.164260] matmul primitive_desc create - attr [PROF:I][0.164043] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00285,ms [API:I][0.164277] matmul primitive create [CORE:I][0.164233] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.164237] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.149509] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.151248] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.74ms graph_exe_count=-1 weight_address=0x70db11df7040 [PROF:I][0.165819] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.7619,ms [API:I][0.166185] Memory create [CORE:V0][0.166173] Memory desc init by tag [memory] [CORE:I][0.166179] Memory created [memory] [API:I][0.166202] Memory create - strides [CORE:I][0.166188] Memory desc init by Stride [memory] [CORE:I][0.166194] Memory created [memory] [API:I][0.166216] Memory create [CORE:V0][0.166203] Memory desc init by tag [memory] [CORE:I][0.166209] Memory created [memory] [API:I][0.166139] matmul desc create - no bias [CORE:I][0.166139] matmul desc init [matmul] [API:I][0.166156] matmul primitive_desc create - attr [PROF:I][0.165939] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00232,ms [API:I][0.166172] matmul primitive create [CORE:I][0.166129] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.166134] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.151408] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.152979] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.572ms graph_exe_count=-1 weight_address=0x70db1fdf8040 [PROF:I][0.167550] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.59719,ms [API:I][0.168042] Memory create [CORE:V0][0.168031] Memory desc init by tag [memory] [CORE:I][0.168036] Memory created [memory] [API:I][0.168059] Memory create - strides [CORE:I][0.168049] Memory desc init by Stride [memory] [CORE:I][0.168055] Memory created [memory] [API:I][0.168078] Memory create [CORE:V0][0.168065] Memory desc init by tag [memory] [CORE:I][0.168070] Memory created [memory] [API:I][0.168092] Memory create [CORE:V0][0.168077] Memory desc init by tag [memory] [CORE:I][0.168083] Memory created [memory] [API:I][0.168105] Memory create - strides [CORE:I][0.168092] Memory desc init by Stride [memory] [CORE:I][0.168099] Memory created [memory] [API:I][0.168124] Memory create [CORE:V0][0.168111] Memory desc init by tag [memory] [CORE:I][0.168117] Memory created [memory] [API:I][0.168143] Memory create [CORE:V0][0.168130] Memory desc init by tag [memory] [CORE:I][0.168135] Memory created [memory] [API:I][0.168157] Memory create - strides [CORE:I][0.168144] Memory desc init by Stride [memory] [CORE:I][0.168149] Memory created [memory] [API:I][0.168172] Memory create [CORE:V0][0.168159] Memory desc init by tag [memory] [CORE:I][0.168164] Memory created [memory] [API:I][0.153422] CPU Engine create [CORE:V0][0.168246] CPU Engine created [engine] [CORE:I][0.168251] CPU Engine created [cpu/engine] [API:I][0.153438] CPU Stream create [CORE:I][0.167861] CPU Stream created [stream] [CORE:V0][0.167862] CPU Stream created [cpu/stream] [API:I][0.153458] matmul desc create - no bias [CORE:I][0.168130] matmul desc init [matmul] [API:I][0.153474] matmul primitive_desc create - attr [PROF:I][0.167928] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00213,ms [API:I][0.153488] matmul primitive create [CORE:I][0.168115] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.168120] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.153396] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.153890] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.494ms graph_exe_count=-1 weight_address=0x70dac9def040 [PROF:I][0.168460] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.521137,ms [API:I][0.154024] matmul desc create - no bias [CORE:I][0.168694] matmul desc init [matmul] [API:I][0.154037] matmul primitive_desc create - attr [PROF:I][0.168490] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00175,ms [API:I][0.154051] matmul primitive create [CORE:I][0.168680] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.168686] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.153960] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.154117] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.156ms graph_exe_count=-1 weight_address=0x3b1e6ac0 [PROF:I][0.168686] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.181676,ms [API:I][0.154251] matmul desc create - no bias [CORE:I][0.168922] matmul desc init [matmul] [API:I][0.154265] matmul primitive_desc create - attr [PROF:I][0.168719] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00157,ms [API:I][0.154279] matmul primitive create [CORE:I][0.168908] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.168913] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.154187] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.154330] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.143ms graph_exe_count=-1 weight_address=0x3c1e6b00 [PROF:I][0.168900] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.167365,ms [PROF:V0][0.154464] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.04077,ms [CORE:I][0.168890] CPU Stream deleted [stream] [CORE:I][0.169295] CPU Engine deleted [engine] [API:I][0.169567] Memory create [CORE:V0][0.169557] Memory desc init by tag [memory] [CORE:I][0.169564] Memory created [memory] [API:I][0.169586] Memory create - strides [CORE:I][0.169573] Memory desc init by Stride [memory] [CORE:I][0.169580] Memory created [memory] [API:I][0.169604] Memory create [CORE:V0][0.169590] Memory desc init by tag [memory] [CORE:I][0.169596] Memory created [memory] [API:I][0.169526] matmul desc create - no bias [CORE:I][0.169526] matmul desc init [matmul] [API:I][0.169542] matmul primitive_desc create - attr [PROF:I][0.169325] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.0026,ms [API:I][0.169558] matmul primitive create [CORE:I][0.169515] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.169519] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.154791] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.155326] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.536ms graph_exe_count=-1 weight_address=0x70dacddf0040 [PROF:I][0.169896] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.557518,ms [API:I][0.170310] Memory create [CORE:V0][0.170298] Memory desc init by tag [memory] [CORE:I][0.170305] Memory created [memory] [API:I][0.170327] Memory create - strides [CORE:I][0.170313] Memory desc init by Stride [memory] [CORE:I][0.170318] Memory created [memory] [API:I][0.170341] Memory create [CORE:V0][0.170327] Memory desc init by tag [memory] [CORE:I][0.170333] Memory created [memory] [API:I][0.170264] matmul desc create - no bias [CORE:I][0.170263] matmul desc init [matmul] [API:I][0.170281] matmul primitive_desc create - attr [PROF:I][0.170063] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.0022,ms [API:I][0.170297] matmul primitive create [CORE:I][0.170254] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.170259] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.155533] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.157301] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.768ms graph_exe_count=-1 weight_address=0x70dad1df1040 [PROF:I][0.171872] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.79457,ms [API:I][0.172236] Memory create [CORE:V0][0.172224] Memory desc init by tag [memory] [CORE:I][0.172231] Memory created [memory] [API:I][0.172253] Memory create - strides [CORE:I][0.172241] Memory desc init by Stride [memory] [CORE:I][0.172246] Memory created [memory] [API:I][0.172270] Memory create [CORE:V0][0.172257] Memory desc init by tag [memory] [CORE:I][0.172263] Memory created [memory] [API:I][0.172287] Memory create [CORE:V0][0.172274] Memory desc init by tag [memory] [CORE:I][0.172281] Memory created [memory] [API:I][0.172307] Memory create [CORE:V0][0.172294] Memory desc init by tag [memory] [CORE:I][0.172300] Memory created [memory] [API:I][0.172234] matmul desc create - no bias [CORE:I][0.172232] matmul desc init [matmul] [API:I][0.172249] matmul primitive_desc create - attr [PROF:I][0.172030] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00203,ms [API:I][0.172262] matmul primitive create [CORE:I][0.172217] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.172223] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.157499] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.159279] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.782ms graph_exe_count=-1 weight_address=0x70dadfdf2040 [PROF:I][0.173850] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.8089,ms [API:I][0.174220] Memory create [CORE:V0][0.174208] Memory desc init by tag [memory] [CORE:I][0.174215] Memory created [memory] [API:I][0.174237] Memory create - strides [CORE:I][0.174224] Memory desc init by Stride [memory] [CORE:I][0.174228] Memory created [memory] [API:I][0.174250] Memory create [CORE:V0][0.174237] Memory desc init by tag [memory] [CORE:I][0.174243] Memory created [memory] [API:I][0.174173] matmul desc create - no bias [CORE:I][0.174172] matmul desc init [matmul] [API:I][0.174189] matmul primitive_desc create - attr [PROF:I][0.173971] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00227,ms [API:I][0.174205] matmul primitive create [CORE:I][0.174163] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.174169] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.159443] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.160990] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.547ms graph_exe_count=-1 weight_address=0x70daeddf3040 [PROF:I][0.175561] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.57429,ms [API:I][0.176036] Memory create [CORE:V0][0.176025] Memory desc init by tag [memory] [CORE:I][0.176032] Memory created [memory] [API:I][0.176055] Memory create - strides [CORE:I][0.176043] Memory desc init by Stride [memory] [CORE:I][0.176048] Memory created [memory] [API:I][0.176071] Memory create [CORE:V0][0.176056] Memory desc init by tag [memory] [CORE:I][0.176062] Memory created [memory] [API:I][0.176090] Memory create [CORE:V0][0.176076] Memory desc init by tag [memory] [CORE:I][0.176083] Memory created [memory] [API:I][0.176106] Memory create - strides [CORE:I][0.176095] Memory desc init by Stride [memory] [CORE:I][0.176100] Memory created [memory] [API:I][0.176125] Memory create [CORE:V0][0.176113] Memory desc init by tag [memory] [CORE:I][0.176119] Memory created [memory] [API:I][0.176143] Memory create [CORE:V0][0.176130] Memory desc init by tag [memory] [CORE:I][0.176136] Memory created [memory] [API:I][0.176158] Memory create - strides [CORE:I][0.176146] Memory desc init by Stride [memory] [CORE:I][0.176151] Memory created [memory] [API:I][0.176175] Memory create [CORE:V0][0.176162] Memory desc init by tag [memory] [CORE:I][0.176169] Memory created [memory] [API:I][0.161425] CPU Engine create [CORE:V0][0.176248] CPU Engine created [engine] [CORE:I][0.176255] CPU Engine created [cpu/engine] [API:I][0.161441] CPU Stream create [CORE:I][0.175865] CPU Stream created [stream] [CORE:V0][0.175866] CPU Stream created [cpu/stream] [API:I][0.161462] matmul desc create - no bias [CORE:I][0.176134] matmul desc init [matmul] [API:I][0.161479] matmul primitive_desc create - attr [PROF:I][0.175935] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.003,ms [API:I][0.161495] matmul primitive create [CORE:I][0.176123] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.176128] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.161404] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.161922] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.519ms graph_exe_count=-1 weight_address=0x70da97dea040 [PROF:I][0.176493] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.545378,ms [API:I][0.162058] matmul desc create - no bias [CORE:I][0.176729] matmul desc init [matmul] [API:I][0.162074] matmul primitive_desc create - attr [PROF:I][0.176527] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00169,ms [API:I][0.162087] matmul primitive create [CORE:I][0.176717] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.176721] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.161996] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.162140] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.145ms graph_exe_count=-1 weight_address=0x3d1eebc0 [PROF:I][0.176710] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.169385,ms [API:I][0.162275] matmul desc create - no bias [CORE:I][0.176947] matmul desc init [matmul] [API:I][0.162290] matmul primitive_desc create - attr [PROF:I][0.176743] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00153,ms [API:I][0.162303] matmul primitive create [CORE:I][0.176931] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.176936] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.162210] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.162359] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.149ms graph_exe_count=-1 weight_address=0x3e1eec00 [PROF:I][0.176928] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.172725,ms [PROF:V0][0.162489] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.06396,ms [CORE:I][0.176915] CPU Stream deleted [stream] [CORE:I][0.177320] CPU Engine deleted [engine] [API:I][0.177571] Memory create [CORE:V0][0.177560] Memory desc init by tag [memory] [CORE:I][0.177567] Memory created [memory] [API:I][0.177589] Memory create - strides [CORE:I][0.177578] Memory desc init by Stride [memory] [CORE:I][0.177583] Memory created [memory] [API:I][0.177606] Memory create [CORE:V0][0.177593] Memory desc init by tag [memory] [CORE:I][0.177599] Memory created [memory] [API:I][0.177529] matmul desc create - no bias [CORE:I][0.177527] matmul desc init [matmul] [API:I][0.177543] matmul primitive_desc create - attr [PROF:I][0.177327] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00242,ms [API:I][0.177559] matmul primitive create [CORE:I][0.177515] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.177519] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.162791] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.163310] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.52ms graph_exe_count=-1 weight_address=0x70da9bdeb040 [PROF:I][0.177880] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.541418,ms [API:I][0.178277] Memory create [CORE:V0][0.178265] Memory desc init by tag [memory] [CORE:I][0.178271] Memory created [memory] [API:I][0.178294] Memory create - strides [CORE:I][0.178282] Memory desc init by Stride [memory] [CORE:I][0.178287] Memory created [memory] [API:I][0.178311] Memory create [CORE:V0][0.178299] Memory desc init by tag [memory] [CORE:I][0.178306] Memory created [memory] [API:I][0.178236] matmul desc create - no bias [CORE:I][0.178233] matmul desc init [matmul] [API:I][0.178248] matmul primitive_desc create - attr [PROF:I][0.178027] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00152,ms [API:I][0.178259] matmul primitive create [CORE:I][0.178214] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.178219] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.163493] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.165360] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.867ms graph_exe_count=-1 weight_address=0x70da9fdec040 [PROF:I][0.179932] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.8937,ms [API:I][0.180295] Memory create [CORE:V0][0.180283] Memory desc init by tag [memory] [CORE:I][0.180289] Memory created [memory] [API:I][0.180311] Memory create - strides [CORE:I][0.180298] Memory desc init by Stride [memory] [CORE:I][0.180303] Memory created [memory] [API:I][0.180326] Memory create [CORE:V0][0.180312] Memory desc init by tag [memory] [CORE:I][0.180317] Memory created [memory] [API:I][0.180342] Memory create [CORE:V0][0.180329] Memory desc init by tag [memory] [CORE:I][0.180335] Memory created [memory] [API:I][0.180361] Memory create [CORE:V0][0.180348] Memory desc init by tag [memory] [CORE:I][0.180353] Memory created [memory] [API:I][0.180287] matmul desc create - no bias [CORE:I][0.180287] matmul desc init [matmul] [API:I][0.180307] matmul primitive_desc create - attr [PROF:I][0.180089] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00251,ms [API:I][0.180324] matmul primitive create [CORE:I][0.180279] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.180285] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.165559] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.167299] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.741ms graph_exe_count=-1 weight_address=0x70daadded040 [PROF:I][0.181871] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.76785,ms [API:I][0.182239] Memory create [CORE:V0][0.182227] Memory desc init by tag [memory] [CORE:I][0.182233] Memory created [memory] [API:I][0.182255] Memory create - strides [CORE:I][0.182244] Memory desc init by Stride [memory] [CORE:I][0.182250] Memory created [memory] [API:I][0.182272] Memory create [CORE:V0][0.182259] Memory desc init by tag [memory] [CORE:I][0.182266] Memory created [memory] [API:I][0.182196] matmul desc create - no bias [CORE:I][0.182195] matmul desc init [matmul] [API:I][0.182213] matmul primitive_desc create - attr [PROF:I][0.181995] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00201,ms [API:I][0.182229] matmul primitive create [CORE:I][0.182184] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.182189] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.167464] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.169045] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.582ms graph_exe_count=-1 weight_address=0x70dabbdee040 [PROF:I][0.183616] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.60765,ms [API:I][0.184111] Memory create [CORE:V0][0.184100] Memory desc init by tag [memory] [CORE:I][0.184106] Memory created [memory] [API:I][0.184129] Memory create - strides [CORE:I][0.184116] Memory desc init by Stride [memory] [CORE:I][0.184121] Memory created [memory] [API:I][0.184143] Memory create [CORE:V0][0.184130] Memory desc init by tag [memory] [CORE:I][0.184135] Memory created [memory] [API:I][0.184161] Memory create [CORE:V0][0.184148] Memory desc init by tag [memory] [CORE:I][0.184154] Memory created [memory] [API:I][0.184177] Memory create - strides [CORE:I][0.184164] Memory desc init by Stride [memory] [CORE:I][0.184167] Memory created [memory] [API:I][0.184189] Memory create [CORE:V0][0.184176] Memory desc init by tag [memory] [CORE:I][0.184182] Memory created [memory] [API:I][0.184207] Memory create [CORE:V0][0.184193] Memory desc init by tag [memory] [CORE:I][0.184198] Memory created [memory] [API:I][0.184221] Memory create - strides [CORE:I][0.184207] Memory desc init by Stride [memory] [CORE:I][0.184212] Memory created [memory] [API:I][0.184235] Memory create [CORE:V0][0.184221] Memory desc init by tag [memory] [CORE:I][0.184228] Memory created [memory] [API:I][0.169485] CPU Engine create [CORE:V0][0.184307] CPU Engine created [engine] [CORE:I][0.184311] CPU Engine created [cpu/engine] [API:I][0.169498] CPU Stream create [CORE:I][0.183921] CPU Stream created [stream] [CORE:V0][0.183922] CPU Stream created [cpu/stream] [API:I][0.169519] matmul desc create - no bias [CORE:I][0.184190] matmul desc init [matmul] [API:I][0.169535] matmul primitive_desc create - attr [PROF:I][0.183990] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00255,ms [API:I][0.169552] matmul primitive create [CORE:I][0.184180] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.184184] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.169458] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.169969] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.512ms graph_exe_count=-1 weight_address=0x70da65de5040 [PROF:I][0.184538] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.534818,ms [API:I][0.170101] matmul desc create - no bias [CORE:I][0.184773] matmul desc init [matmul] [API:I][0.170116] matmul primitive_desc create - attr [PROF:I][0.184570] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00201,ms [API:I][0.170130] matmul primitive create [CORE:I][0.184759] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.184763] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.170037] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.170196] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.159ms graph_exe_count=-1 weight_address=0x3f1f6cc0 [PROF:I][0.184766] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.183026,ms [API:I][0.170330] matmul desc create - no bias [CORE:I][0.185000] matmul desc init [matmul] [API:I][0.170339] matmul primitive_desc create - attr [PROF:I][0.184790] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00101,ms [API:I][0.170349] matmul primitive create [CORE:I][0.184975] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.184980] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.170254] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.170400] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.146ms graph_exe_count=-1 weight_address=0x401f6d00 [PROF:I][0.184970] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.170676,ms [PROF:V0][0.170533] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.04785,ms [CORE:I][0.184959] CPU Stream deleted [stream] [CORE:I][0.185364] CPU Engine deleted [engine] [API:I][0.185638] Memory create [CORE:V0][0.185627] Memory desc init by tag [memory] [CORE:I][0.185634] Memory created [memory] [API:I][0.185656] Memory create - strides [CORE:I][0.185641] Memory desc init by Stride [memory] [CORE:I][0.185646] Memory created [memory] [API:I][0.185668] Memory create [CORE:V0][0.185655] Memory desc init by tag [memory] [CORE:I][0.185662] Memory created [memory] [API:I][0.185593] matmul desc create - no bias [CORE:I][0.185593] matmul desc init [matmul] [API:I][0.185610] matmul primitive_desc create - attr [PROF:I][0.185392] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00238,ms [API:I][0.185626] matmul primitive create [CORE:I][0.185581] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.185585] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.170858] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.171400] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.543ms graph_exe_count=-1 weight_address=0x70da69de6040 [PROF:I][0.185970] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.565099,ms [API:I][0.186381] Memory create [CORE:V0][0.186369] Memory desc init by tag [memory] [CORE:I][0.186375] Memory created [memory] [API:I][0.186398] Memory create - strides [CORE:I][0.186384] Memory desc init by Stride [memory] [CORE:I][0.186390] Memory created [memory] [API:I][0.186412] Memory create [CORE:V0][0.186399] Memory desc init by tag [memory] [CORE:I][0.186402] Memory created [memory] [API:I][0.186331] matmul desc create - no bias [CORE:I][0.186329] matmul desc init [matmul] [API:I][0.186346] matmul primitive_desc create - attr [PROF:I][0.186126] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00182,ms [API:I][0.186359] matmul primitive create [CORE:I][0.186313] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.186318] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.171592] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.173354] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.763ms graph_exe_count=-1 weight_address=0x70da6dde7040 [PROF:I][0.187926] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.78873,ms [API:I][0.188291] Memory create [CORE:V0][0.188279] Memory desc init by tag [memory] [CORE:I][0.188285] Memory created [memory] [API:I][0.188308] Memory create - strides [CORE:I][0.188294] Memory desc init by Stride [memory] [CORE:I][0.188300] Memory created [memory] [API:I][0.188322] Memory create [CORE:V0][0.188309] Memory desc init by tag [memory] [CORE:I][0.188316] Memory created [memory] [API:I][0.188340] Memory create [CORE:V0][0.188325] Memory desc init by tag [memory] [CORE:I][0.188330] Memory created [memory] [API:I][0.188355] Memory create [CORE:V0][0.188343] Memory desc init by tag [memory] [CORE:I][0.188350] Memory created [memory] [API:I][0.188284] matmul desc create - no bias [CORE:I][0.188284] matmul desc init [matmul] [API:I][0.188302] matmul primitive_desc create - attr [PROF:I][0.188084] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00229,ms [API:I][0.188319] matmul primitive create [CORE:I][0.188274] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.188279] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.173554] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.175302] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.75ms graph_exe_count=-1 weight_address=0x70da7bde8040 [PROF:I][0.189873] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.77523,ms [API:I][0.190244] Memory create [CORE:V0][0.190232] Memory desc init by tag [memory] [CORE:I][0.190238] Memory created [memory] [API:I][0.190261] Memory create - strides [CORE:I][0.190247] Memory desc init by Stride [memory] [CORE:I][0.190253] Memory created [memory] [API:I][0.190275] Memory create [CORE:V0][0.190261] Memory desc init by tag [memory] [CORE:I][0.190268] Memory created [memory] [API:I][0.190198] matmul desc create - no bias [CORE:I][0.190197] matmul desc init [matmul] [API:I][0.190211] matmul primitive_desc create - attr [PROF:I][0.189992] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00207,ms [API:I][0.190223] matmul primitive create [CORE:I][0.190176] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.190180] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.175451] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.176995] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.544ms graph_exe_count=-1 weight_address=0x70da89de9040 [PROF:I][0.191565] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.56493,ms [API:I][0.192041] Memory create [CORE:V0][0.192029] Memory desc init by tag [memory] [CORE:I][0.192035] Memory created [memory] [API:I][0.192061] Memory create - strides [CORE:I][0.192048] Memory desc init by Stride [memory] [CORE:I][0.192055] Memory created [memory] [API:I][0.192077] Memory create [CORE:V0][0.192066] Memory desc init by tag [memory] [CORE:I][0.192072] Memory created [memory] [API:I][0.192095] Memory create [CORE:V0][0.192081] Memory desc init by tag [memory] [CORE:I][0.192088] Memory created [memory] [API:I][0.192110] Memory create - strides [CORE:I][0.192097] Memory desc init by Stride [memory] [CORE:I][0.192102] Memory created [memory] [API:I][0.192125] Memory create [CORE:V0][0.192113] Memory desc init by tag [memory] [CORE:I][0.192118] Memory created [memory] [API:I][0.192143] Memory create [CORE:V0][0.192130] Memory desc init by tag [memory] [CORE:I][0.192136] Memory created [memory] [API:I][0.192158] Memory create - strides [CORE:I][0.192146] Memory desc init by Stride [memory] [CORE:I][0.192152] Memory created [memory] [API:I][0.192175] Memory create [CORE:V0][0.192163] Memory desc init by tag [memory] [CORE:I][0.192167] Memory created [memory] [API:I][0.177423] CPU Engine create [CORE:V0][0.192247] CPU Engine created [engine] [CORE:I][0.192252] CPU Engine created [cpu/engine] [API:I][0.177439] CPU Stream create [CORE:I][0.191863] CPU Stream created [stream] [CORE:V0][0.191863] CPU Stream created [cpu/stream] [API:I][0.177459] matmul desc create - no bias [CORE:I][0.192131] matmul desc init [matmul] [API:I][0.177476] matmul primitive_desc create - attr [PROF:I][0.191930] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.0023,ms [API:I][0.177490] matmul primitive create [CORE:I][0.192118] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.192122] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.177398] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.177941] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.543ms graph_exe_count=-1 weight_address=0x70da5dde3040 [PROF:I][0.192511] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.569489,ms [API:I][0.178075] matmul desc create - no bias [CORE:I][0.192745] matmul desc init [matmul] [API:I][0.178088] matmul primitive_desc create - attr [PROF:I][0.192541] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00165,ms [API:I][0.178101] matmul primitive create [CORE:I][0.192730] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.192735] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.178008] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.178156] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.147ms graph_exe_count=-1 weight_address=0x411f6d40 [PROF:I][0.192726] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.172116,ms [API:I][0.178290] matmul desc create - no bias [CORE:I][0.192961] matmul desc init [matmul] [API:I][0.178305] matmul primitive_desc create - attr [PROF:I][0.192759] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00165,ms [API:I][0.178319] matmul primitive create [CORE:I][0.192946] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.192950] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.178224] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.178367] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.143ms graph_exe_count=-1 weight_address=0x421f6d80 [PROF:I][0.192937] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.166255,ms [PROF:V0][0.178497] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.07397,ms [CORE:I][0.192923] CPU Stream deleted [stream] [CORE:I][0.193329] CPU Engine deleted [engine] [API:I][0.193585] Memory create [CORE:V0][0.193575] Memory desc init by tag [memory] [CORE:I][0.193581] Memory created [memory] [API:I][0.193604] Memory create - strides [CORE:I][0.193591] Memory desc init by Stride [memory] [CORE:I][0.193597] Memory created [memory] [API:I][0.193619] Memory create [CORE:V0][0.193606] Memory desc init by tag [memory] [CORE:I][0.193613] Memory created [memory] [API:I][0.193543] matmul desc create - no bias [CORE:I][0.193540] matmul desc init [matmul] [API:I][0.193556] matmul primitive_desc create - attr [PROF:I][0.193338] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00222,ms [API:I][0.193572] matmul primitive create [CORE:I][0.193528] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.193533] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.178807] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.179327] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.52ms graph_exe_count=-1 weight_address=0x70da61de4040 [PROF:I][0.193898] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.545718,ms [API:I][0.194302] Memory create [CORE:V0][0.194290] Memory desc init by tag [memory] [CORE:I][0.194296] Memory created [memory] [API:I][0.194318] Memory create - strides [CORE:I][0.194304] Memory desc init by Stride [memory] [CORE:I][0.194309] Memory created [memory] [API:I][0.194332] Memory create [CORE:V0][0.194318] Memory desc init by tag [memory] [CORE:I][0.194325] Memory created [memory] [API:I][0.194255] matmul desc create - no bias [CORE:I][0.194254] matmul desc init [matmul] [API:I][0.194271] matmul primitive_desc create - attr [PROF:I][0.194053] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00224,ms [API:I][0.194287] matmul primitive create [CORE:I][0.194243] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.194248] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.179523] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.181318] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.796ms graph_exe_count=-1 weight_address=0x70dc52bc6040 [PROF:I][0.195891] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.82304,ms [API:I][0.196255] Memory create [CORE:V0][0.196242] Memory desc init by tag [memory] [CORE:I][0.196249] Memory created [memory] [API:I][0.196272] Memory create - strides [CORE:I][0.196257] Memory desc init by Stride [memory] [CORE:I][0.196263] Memory created [memory] [API:I][0.196285] Memory create [CORE:V0][0.196273] Memory desc init by tag [memory] [CORE:I][0.196279] Memory created [memory] [API:I][0.196303] Memory create [CORE:V0][0.196291] Memory desc init by tag [memory] [CORE:I][0.196297] Memory created [memory] [API:I][0.196322] Memory create [CORE:V0][0.196309] Memory desc init by tag [memory] [CORE:I][0.196315] Memory created [memory] [API:I][0.196249] matmul desc create - no bias [CORE:I][0.196248] matmul desc init [matmul] [API:I][0.196266] matmul primitive_desc create - attr [PROF:I][0.196047] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00191,ms [API:I][0.196281] matmul primitive create [CORE:I][0.196235] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.196239] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.181512] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.183268] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.758ms graph_exe_count=-1 weight_address=0x70dc60bc7040 [PROF:I][0.197838] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.77941,ms [API:I][0.198205] Memory create [CORE:V0][0.198193] Memory desc init by tag [memory] [CORE:I][0.198200] Memory created [memory] [API:I][0.198222] Memory create - strides [CORE:I][0.198209] Memory desc init by Stride [memory] [CORE:I][0.198215] Memory created [memory] [API:I][0.198237] Memory create [CORE:V0][0.198223] Memory desc init by tag [memory] [CORE:I][0.198231] Memory created [memory] [API:I][0.198160] matmul desc create - no bias [CORE:I][0.198158] matmul desc init [matmul] [API:I][0.198176] matmul primitive_desc create - attr [PROF:I][0.197958] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00251,ms [API:I][0.198192] matmul primitive create [CORE:I][0.198149] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.198154] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.183429] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.184974] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.545ms graph_exe_count=-1 weight_address=0x70dc6ebc8040 [PROF:I][0.199545] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.57198,ms [API:I][0.200037] Memory create [CORE:V0][0.200026] Memory desc init by tag [memory] [CORE:I][0.200032] Memory created [memory] [API:I][0.200055] Memory create - strides [CORE:I][0.200043] Memory desc init by Stride [memory] [CORE:I][0.200048] Memory created [memory] [API:I][0.200071] Memory create [CORE:V0][0.200057] Memory desc init by tag [memory] [CORE:I][0.200063] Memory created [memory] [API:I][0.200088] Memory create [CORE:V0][0.200076] Memory desc init by tag [memory] [CORE:I][0.200081] Memory created [memory] [API:I][0.200103] Memory create - strides [CORE:I][0.200092] Memory desc init by Stride [memory] [CORE:I][0.200096] Memory created [memory] [API:I][0.200121] Memory create [CORE:V0][0.200108] Memory desc init by tag [memory] [CORE:I][0.200114] Memory created [memory] [API:I][0.200138] Memory create [CORE:V0][0.200125] Memory desc init by tag [memory] [CORE:I][0.200131] Memory created [memory] [API:I][0.200154] Memory create - strides [CORE:I][0.200141] Memory desc init by Stride [memory] [CORE:I][0.200148] Memory created [memory] [API:I][0.200170] Memory create [CORE:V0][0.200156] Memory desc init by tag [memory] [CORE:I][0.200164] Memory created [memory] [API:I][0.185421] CPU Engine create [CORE:V0][0.200244] CPU Engine created [engine] [CORE:I][0.200251] CPU Engine created [cpu/engine] [API:I][0.185438] CPU Stream create [CORE:I][0.199861] CPU Stream created [stream] [CORE:V0][0.199862] CPU Stream created [cpu/stream] [API:I][0.185458] matmul desc create - no bias [CORE:I][0.200131] matmul desc init [matmul] [API:I][0.185476] matmul primitive_desc create - attr [PROF:I][0.199932] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.0028,ms [API:I][0.185493] matmul primitive create [CORE:I][0.200123] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.200128] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.185404] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.185929] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.525ms graph_exe_count=-1 weight_address=0x70dc20bc1040 [PROF:I][0.200500] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.552729,ms [API:I][0.186064] matmul desc create - no bias [CORE:I][0.200734] matmul desc init [matmul] [API:I][0.186077] matmul primitive_desc create - attr [PROF:I][0.200531] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00197,ms [API:I][0.186091] matmul primitive create [CORE:I][0.200720] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.200725] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.185999] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.186167] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.168ms graph_exe_count=-1 weight_address=0x43206ec0 [PROF:I][0.200737] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.192347,ms [API:I][0.186301] matmul desc create - no bias [CORE:I][0.200973] matmul desc init [matmul] [API:I][0.186317] matmul primitive_desc create - attr [PROF:I][0.200770] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.0015,ms [API:I][0.186330] matmul primitive create [CORE:I][0.200959] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.200964] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.186238] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.186382] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.145ms graph_exe_count=-1 weight_address=0x44206f00 [PROF:I][0.200952] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.167856,ms [PROF:V0][0.186515] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.09497,ms [CORE:I][0.200942] CPU Stream deleted [stream] [CORE:I][0.201347] CPU Engine deleted [engine] [API:I][0.201630] Memory create [CORE:V0][0.201619] Memory desc init by tag [memory] [CORE:I][0.201626] Memory created [memory] [API:I][0.201649] Memory create - strides [CORE:I][0.201634] Memory desc init by Stride [memory] [CORE:I][0.201639] Memory created [memory] [API:I][0.201660] Memory create [CORE:V0][0.201647] Memory desc init by tag [memory] [CORE:I][0.201654] Memory created [memory] [API:I][0.201584] matmul desc create - no bias [CORE:I][0.201583] matmul desc init [matmul] [API:I][0.201599] matmul primitive_desc create - attr [PROF:I][0.201383] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00243,ms [API:I][0.201616] matmul primitive create [CORE:I][0.201572] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.201576] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.186848] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.187382] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.534ms graph_exe_count=-1 weight_address=0x70dc24bc2040 [PROF:I][0.201952] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.556119,ms [API:I][0.202364] Memory create [CORE:V0][0.202352] Memory desc init by tag [memory] [CORE:I][0.202359] Memory created [memory] [API:I][0.202381] Memory create - strides [CORE:I][0.202367] Memory desc init by Stride [memory] [CORE:I][0.202373] Memory created [memory] [API:I][0.202395] Memory create [CORE:V0][0.202382] Memory desc init by tag [memory] [CORE:I][0.202388] Memory created [memory] [API:I][0.202319] matmul desc create - no bias [CORE:I][0.202318] matmul desc init [matmul] [API:I][0.202336] matmul primitive_desc create - attr [PROF:I][0.202119] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00229,ms [API:I][0.202352] matmul primitive create [CORE:I][0.202308] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.202312] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.187583] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.189297] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.714ms graph_exe_count=-1 weight_address=0x70dc28bc3040 [PROF:I][0.203868] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.73643,ms [API:I][0.204231] Memory create [CORE:V0][0.204218] Memory desc init by tag [memory] [CORE:I][0.204225] Memory created [memory] [API:I][0.204248] Memory create - strides [CORE:I][0.204234] Memory desc init by Stride [memory] [CORE:I][0.204240] Memory created [memory] [API:I][0.204262] Memory create [CORE:V0][0.204249] Memory desc init by tag [memory] [CORE:I][0.204256] Memory created [memory] [API:I][0.204279] Memory create [CORE:V0][0.204264] Memory desc init by tag [memory] [CORE:I][0.204269] Memory created [memory] [API:I][0.204295] Memory create [CORE:V0][0.204282] Memory desc init by tag [memory] [CORE:I][0.204289] Memory created [memory] [API:I][0.204222] matmul desc create - no bias [CORE:I][0.204221] matmul desc init [matmul] [API:I][0.204240] matmul primitive_desc create - attr [PROF:I][0.204023] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.0026,ms [API:I][0.204257] matmul primitive create [CORE:I][0.204213] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.204218] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.189493] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.191259] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.766ms graph_exe_count=-1 weight_address=0x70dc36bc4040 [PROF:I][0.205830] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.7928,ms [API:I][0.206198] Memory create [CORE:V0][0.206185] Memory desc init by tag [memory] [CORE:I][0.206192] Memory created [memory] [API:I][0.206214] Memory create - strides [CORE:I][0.206200] Memory desc init by Stride [memory] [CORE:I][0.206206] Memory created [memory] [API:I][0.206228] Memory create [CORE:V0][0.206214] Memory desc init by tag [memory] [CORE:I][0.206219] Memory created [memory] [API:I][0.206151] matmul desc create - no bias [CORE:I][0.206149] matmul desc init [matmul] [API:I][0.206167] matmul primitive_desc create - attr [PROF:I][0.205949] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00248,ms [API:I][0.206183] matmul primitive create [CORE:I][0.206139] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.206143] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.191414] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.192968] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.554ms graph_exe_count=-1 weight_address=0x70dc44bc5040 [PROF:I][0.207539] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.57599,ms [API:I][0.208011] Memory create [CORE:V0][0.208000] Memory desc init by tag [memory] [CORE:I][0.208007] Memory created [memory] [API:I][0.208030] Memory create - strides [CORE:I][0.208017] Memory desc init by Stride [memory] [CORE:I][0.208024] Memory created [memory] [API:I][0.208047] Memory create [CORE:V0][0.208032] Memory desc init by tag [memory] [CORE:I][0.208038] Memory created [memory] [API:I][0.208063] Memory create [CORE:V0][0.208051] Memory desc init by tag [memory] [CORE:I][0.208057] Memory created [memory] [API:I][0.208080] Memory create - strides [CORE:I][0.208067] Memory desc init by Stride [memory] [CORE:I][0.208071] Memory created [memory] [API:I][0.208095] Memory create [CORE:V0][0.208082] Memory desc init by tag [memory] [CORE:I][0.208088] Memory created [memory] [API:I][0.208111] Memory create [CORE:V0][0.208097] Memory desc init by tag [memory] [CORE:I][0.208103] Memory created [memory] [API:I][0.208127] Memory create - strides [CORE:I][0.208113] Memory desc init by Stride [memory] [CORE:I][0.208119] Memory created [memory] [API:I][0.208144] Memory create [CORE:V0][0.208132] Memory desc init by tag [memory] [CORE:I][0.208137] Memory created [memory] [API:I][0.193396] CPU Engine create [CORE:V0][0.208219] CPU Engine created [engine] [CORE:I][0.208224] CPU Engine created [cpu/engine] [API:I][0.193411] CPU Stream create [CORE:I][0.207834] CPU Stream created [stream] [CORE:V0][0.207836] CPU Stream created [cpu/stream] [API:I][0.193431] matmul desc create - no bias [CORE:I][0.208101] matmul desc init [matmul] [API:I][0.193447] matmul primitive_desc create - attr [PROF:I][0.207904] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00338,ms [API:I][0.193465] matmul primitive create [CORE:I][0.208094] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.208100] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.193377] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.193975] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.599ms graph_exe_count=-1 weight_address=0x70dbeebbc040 [PROF:I][0.208546] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.62711,ms [API:I][0.194111] matmul desc create - no bias [CORE:I][0.208783] matmul desc init [matmul] [API:I][0.194129] matmul primitive_desc create - attr [PROF:I][0.208581] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.0017,ms [API:I][0.194143] matmul primitive create [CORE:I][0.208770] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.208775] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.194049] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.194200] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.152ms graph_exe_count=-1 weight_address=0x4520efc0 [PROF:I][0.208769] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.175326,ms [API:I][0.194332] matmul desc create - no bias [CORE:I][0.209003] matmul desc init [matmul] [API:I][0.194345] matmul primitive_desc create - attr [PROF:I][0.208799] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00156,ms [API:I][0.194359] matmul primitive create [CORE:I][0.208988] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.208992] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.194265] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.194414] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.149ms graph_exe_count=-1 weight_address=0x4620f000 [PROF:I][0.208983] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.171336,ms [PROF:V0][0.194544] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.14697,ms [CORE:I][0.208970] CPU Stream deleted [stream] [CORE:I][0.209377] CPU Engine deleted [engine] [API:I][0.209646] Memory create [CORE:V0][0.209635] Memory desc init by tag [memory] [CORE:I][0.209642] Memory created [memory] [API:I][0.209665] Memory create - strides [CORE:I][0.209653] Memory desc init by Stride [memory] [CORE:I][0.209659] Memory created [memory] [API:I][0.209682] Memory create [CORE:V0][0.209669] Memory desc init by tag [memory] [CORE:I][0.209674] Memory created [memory] [API:I][0.209605] matmul desc create - no bias [CORE:I][0.209604] matmul desc init [matmul] [API:I][0.209622] matmul primitive_desc create - attr [PROF:I][0.209405] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00251,ms [API:I][0.209638] matmul primitive create [CORE:I][0.209594] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.209598] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.194871] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.195407] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.537ms graph_exe_count=-1 weight_address=0x70dbf2bbd040 [PROF:I][0.209977] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.559159,ms [API:I][0.210381] Memory create [CORE:V0][0.210368] Memory desc init by tag [memory] [CORE:I][0.210374] Memory created [memory] [API:I][0.210398] Memory create - strides [CORE:I][0.210386] Memory desc init by Stride [memory] [CORE:I][0.210392] Memory created [memory] [API:I][0.210414] Memory create [CORE:V0][0.210401] Memory desc init by tag [memory] [CORE:I][0.210406] Memory created [memory] [API:I][0.210337] matmul desc create - no bias [CORE:I][0.210337] matmul desc init [matmul] [API:I][0.210353] matmul primitive_desc create - attr [PROF:I][0.210135] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00232,ms [API:I][0.210368] matmul primitive create [CORE:I][0.210324] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.210328] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.195600] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.197352] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.752ms graph_exe_count=-1 weight_address=0x70dbf6bbe040 [PROF:I][0.211922] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.77414,ms [API:I][0.212282] Memory create [CORE:V0][0.212270] Memory desc init by tag [memory] [CORE:I][0.212277] Memory created [memory] [API:I][0.212300] Memory create - strides [CORE:I][0.212286] Memory desc init by Stride [memory] [CORE:I][0.212291] Memory created [memory] [API:I][0.212313] Memory create [CORE:V0][0.212300] Memory desc init by tag [memory] [CORE:I][0.212306] Memory created [memory] [API:I][0.212330] Memory create [CORE:V0][0.212316] Memory desc init by tag [memory] [CORE:I][0.212320] Memory created [memory] [API:I][0.212345] Memory create [CORE:V0][0.212332] Memory desc init by tag [memory] [CORE:I][0.212338] Memory created [memory] [API:I][0.212270] matmul desc create - no bias [CORE:I][0.212268] matmul desc init [matmul] [API:I][0.212285] matmul primitive_desc create - attr [PROF:I][0.212067] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00186,ms [API:I][0.212299] matmul primitive create [CORE:I][0.212253] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.212256] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.197528] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.199315] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.787ms graph_exe_count=-1 weight_address=0x70dc04bbf040 [PROF:I][0.213885] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.80843,ms [API:I][0.214253] Memory create [CORE:V0][0.214240] Memory desc init by tag [memory] [CORE:I][0.214247] Memory created [memory] [API:I][0.214269] Memory create - strides [CORE:I][0.214256] Memory desc init by Stride [memory] [CORE:I][0.214262] Memory created [memory] [API:I][0.214284] Memory create [CORE:V0][0.214270] Memory desc init by tag [memory] [CORE:I][0.214278] Memory created [memory] [API:I][0.214208] matmul desc create - no bias [CORE:I][0.214205] matmul desc init [matmul] [API:I][0.214222] matmul primitive_desc create - attr [PROF:I][0.214004] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00225,ms [API:I][0.214239] matmul primitive create [CORE:I][0.214195] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.214199] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.199471] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.201059] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.588ms graph_exe_count=-1 weight_address=0x70dc12bc0040 [PROF:I][0.215630] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.61089,ms [API:I][0.216105] Memory create [CORE:V0][0.216095] Memory desc init by tag [memory] [CORE:I][0.216102] Memory created [memory] [API:I][0.216125] Memory create - strides [CORE:I][0.216114] Memory desc init by Stride [memory] [CORE:I][0.216119] Memory created [memory] [API:I][0.216143] Memory create [CORE:V0][0.216129] Memory desc init by tag [memory] [CORE:I][0.216137] Memory created [memory] [API:I][0.216160] Memory create [CORE:V0][0.216147] Memory desc init by tag [memory] [CORE:I][0.216151] Memory created [memory] [API:I][0.216174] Memory create - strides [CORE:I][0.216162] Memory desc init by Stride [memory] [CORE:I][0.216167] Memory created [memory] [API:I][0.216190] Memory create [CORE:V0][0.216176] Memory desc init by tag [memory] [CORE:I][0.216181] Memory created [memory] [API:I][0.216205] Memory create [CORE:V0][0.216192] Memory desc init by tag [memory] [CORE:I][0.216197] Memory created [memory] [API:I][0.216219] Memory create - strides [CORE:I][0.216206] Memory desc init by Stride [memory] [CORE:I][0.216213] Memory created [memory] [API:I][0.216235] Memory create [CORE:V0][0.216222] Memory desc init by tag [memory] [CORE:I][0.216228] Memory created [memory] [API:I][0.201484] CPU Engine create [CORE:V0][0.216308] CPU Engine created [engine] [CORE:I][0.216311] CPU Engine created [cpu/engine] [API:I][0.201497] CPU Stream create [CORE:I][0.215920] CPU Stream created [stream] [CORE:V0][0.215921] CPU Stream created [cpu/stream] [API:I][0.201517] matmul desc create - no bias [CORE:I][0.216189] matmul desc init [matmul] [API:I][0.201533] matmul primitive_desc create - attr [PROF:I][0.215988] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00202,ms [API:I][0.201548] matmul primitive create [CORE:I][0.216175] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.216180] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.201457] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.201959] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.503ms graph_exe_count=-1 weight_address=0x70dbbcbb7040 [PROF:I][0.216529] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.530388,ms [API:I][0.202095] matmul desc create - no bias [CORE:I][0.216765] matmul desc init [matmul] [API:I][0.202107] matmul primitive_desc create - attr [PROF:I][0.216561] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00201,ms [API:I][0.202121] matmul primitive create [CORE:I][0.216749] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.216752] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.202024] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.202186] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.162ms graph_exe_count=-1 weight_address=0x472170c0 [PROF:I][0.216756] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.183476,ms [API:I][0.202322] matmul desc create - no bias [CORE:I][0.216993] matmul desc init [matmul] [API:I][0.202336] matmul primitive_desc create - attr [PROF:I][0.216790] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00165,ms [API:I][0.202350] matmul primitive create [CORE:I][0.216980] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.216985] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.202258] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.202401] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.142ms graph_exe_count=-1 weight_address=0x48217100 [PROF:I][0.216971] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.166575,ms [PROF:V0][0.202535] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.05005,ms [CORE:I][0.216961] CPU Stream deleted [stream] [CORE:I][0.217366] CPU Engine deleted [engine] [API:I][0.217641] Memory create [CORE:V0][0.217630] Memory desc init by tag [memory] [CORE:I][0.217637] Memory created [memory] [API:I][0.217660] Memory create - strides [CORE:I][0.217646] Memory desc init by Stride [memory] [CORE:I][0.217652] Memory created [memory] [API:I][0.217674] Memory create [CORE:V0][0.217661] Memory desc init by tag [memory] [CORE:I][0.217666] Memory created [memory] [API:I][0.217596] matmul desc create - no bias [CORE:I][0.217595] matmul desc init [matmul] [API:I][0.217612] matmul primitive_desc create - attr [PROF:I][0.217394] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00215,ms [API:I][0.217628] matmul primitive create [CORE:I][0.217585] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.217589] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.202863] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.203414] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.551ms graph_exe_count=-1 weight_address=0x70dbc0bb8040 [PROF:I][0.217985] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.576469,ms [API:I][0.218405] Memory create [CORE:V0][0.218393] Memory desc init by tag [memory] [CORE:I][0.218399] Memory created [memory] [API:I][0.218422] Memory create - strides [CORE:I][0.218411] Memory desc init by Stride [memory] [CORE:I][0.218417] Memory created [memory] [API:I][0.218438] Memory create [CORE:V0][0.218426] Memory desc init by tag [memory] [CORE:I][0.218432] Memory created [memory] [API:I][0.218361] matmul desc create - no bias [CORE:I][0.218360] matmul desc init [matmul] [API:I][0.218375] matmul primitive_desc create - attr [PROF:I][0.218155] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00162,ms [API:I][0.218387] matmul primitive create [CORE:I][0.218342] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.218346] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.203620] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.205455] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.835ms graph_exe_count=-1 weight_address=0x70dbc4bb9040 [PROF:I][0.220026] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.86063,ms [API:I][0.220386] Memory create [CORE:V0][0.220374] Memory desc init by tag [memory] [CORE:I][0.220381] Memory created [memory] [API:I][0.220403] Memory create - strides [CORE:I][0.220390] Memory desc init by Stride [memory] [CORE:I][0.220395] Memory created [memory] [API:I][0.220418] Memory create [CORE:V0][0.220404] Memory desc init by tag [memory] [CORE:I][0.220411] Memory created [memory] [API:I][0.220434] Memory create [CORE:V0][0.220420] Memory desc init by tag [memory] [CORE:I][0.220424] Memory created [memory] [API:I][0.220450] Memory create [CORE:V0][0.220438] Memory desc init by tag [memory] [CORE:I][0.220444] Memory created [memory] [API:I][0.220378] matmul desc create - no bias [CORE:I][0.220376] matmul desc init [matmul] [API:I][0.220394] matmul primitive_desc create - attr [PROF:I][0.220177] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00265,ms [API:I][0.220411] matmul primitive create [CORE:I][0.220367] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.220371] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.205646] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.207399] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.754ms graph_exe_count=-1 weight_address=0x70dbd2bba040 [PROF:I][0.221972] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.78068,ms [API:I][0.222341] Memory create [CORE:V0][0.222328] Memory desc init by tag [memory] [CORE:I][0.222335] Memory created [memory] [API:I][0.222358] Memory create - strides [CORE:I][0.222344] Memory desc init by Stride [memory] [CORE:I][0.222349] Memory created [memory] [API:I][0.222372] Memory create [CORE:V0][0.222358] Memory desc init by tag [memory] [CORE:I][0.222363] Memory created [memory] [API:I][0.222293] matmul desc create - no bias [CORE:I][0.222293] matmul desc init [matmul] [API:I][0.222311] matmul primitive_desc create - attr [PROF:I][0.222093] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.002621,ms [API:I][0.222326] matmul primitive create [CORE:I][0.222284] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.222289] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.207563] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.209124] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.562ms graph_exe_count=-1 weight_address=0x70dbe0bbb040 [PROF:I][0.223696] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.58814,ms [API:I][0.224166] Memory create [CORE:V0][0.224156] Memory desc init by tag [memory] [CORE:I][0.224162] Memory created [memory] [API:I][0.224185] Memory create - strides [CORE:I][0.224173] Memory desc init by Stride [memory] [CORE:I][0.224178] Memory created [memory] [API:I][0.224202] Memory create [CORE:V0][0.224189] Memory desc init by tag [memory] [CORE:I][0.224194] Memory created [memory] [API:I][0.224220] Memory create [CORE:V0][0.224206] Memory desc init by tag [memory] [CORE:I][0.224211] Memory created [memory] [API:I][0.224233] Memory create - strides [CORE:I][0.224220] Memory desc init by Stride [memory] [CORE:I][0.224227] Memory created [memory] [API:I][0.224252] Memory create [CORE:V0][0.224240] Memory desc init by tag [memory] [CORE:I][0.224246] Memory created [memory] [API:I][0.224271] Memory create [CORE:V0][0.224258] Memory desc init by tag [memory] [CORE:I][0.224265] Memory created [memory] [API:I][0.224287] Memory create - strides [CORE:I][0.224275] Memory desc init by Stride [memory] [CORE:I][0.224281] Memory created [memory] [API:I][0.224303] Memory create [CORE:V0][0.224292] Memory desc init by tag [memory] [CORE:I][0.224298] Memory created [memory] [API:I][0.209554] CPU Engine create [CORE:V0][0.224378] CPU Engine created [engine] [CORE:I][0.224383] CPU Engine created [cpu/engine] [API:I][0.209571] CPU Stream create [CORE:I][0.223995] CPU Stream created [stream] [CORE:V0][0.223996] CPU Stream created [cpu/stream] [API:I][0.209591] matmul desc create - no bias [CORE:I][0.224262] matmul desc init [matmul] [API:I][0.209606] matmul primitive_desc create - attr [PROF:I][0.224061] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.002711,ms [API:I][0.209623] matmul primitive create [CORE:I][0.224253] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.224258] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.209534] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.210059] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.526ms graph_exe_count=-1 weight_address=0x70db8abb2040 [PROF:I][0.224629] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.552458,ms [API:I][0.210193] matmul desc create - no bias [CORE:I][0.224863] matmul desc init [matmul] [API:I][0.210214] matmul primitive_desc create - attr [PROF:I][0.224666] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00133,ms [API:I][0.210224] matmul primitive create [CORE:I][0.224851] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.224855] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.210129] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.210276] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.146ms graph_exe_count=-1 weight_address=0x4921f1c0 [PROF:I][0.224846] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.170505,ms [API:I][0.210411] matmul desc create - no bias [CORE:I][0.225082] matmul desc init [matmul] [API:I][0.210424] matmul primitive_desc create - attr [PROF:I][0.224878] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00156,ms [API:I][0.210438] matmul primitive create [CORE:I][0.225066] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.225071] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.210344] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.210494] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.15ms graph_exe_count=-1 weight_address=0x4a21f200 [PROF:I][0.225065] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.174636,ms [PROF:V0][0.210629] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.07397,ms [CORE:I][0.225055] CPU Stream deleted [stream] [CORE:I][0.225459] CPU Engine deleted [engine] [API:I][0.225716] Memory create [CORE:V0][0.225706] Memory desc init by tag [memory] [CORE:I][0.225712] Memory created [memory] [API:I][0.225735] Memory create - strides [CORE:I][0.225723] Memory desc init by Stride [memory] [CORE:I][0.225728] Memory created [memory] [API:I][0.225752] Memory create [CORE:V0][0.225739] Memory desc init by tag [memory] [CORE:I][0.225746] Memory created [memory] [API:I][0.225676] matmul desc create - no bias [CORE:I][0.225674] matmul desc init [matmul] [API:I][0.225691] matmul primitive_desc create - attr [PROF:I][0.225472] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00187,ms [API:I][0.225704] matmul primitive create [CORE:I][0.225659] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.225664] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.210938] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.211492] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.554ms graph_exe_count=-1 weight_address=0x70db8ebb3040 [PROF:I][0.226062] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.578719,ms [API:I][0.226476] Memory create [CORE:V0][0.226464] Memory desc init by tag [memory] [CORE:I][0.226470] Memory created [memory] [API:I][0.226493] Memory create - strides [CORE:I][0.226482] Memory desc init by Stride [memory] [CORE:I][0.226487] Memory created [memory] [API:I][0.226510] Memory create [CORE:V0][0.226497] Memory desc init by tag [memory] [CORE:I][0.226504] Memory created [memory] [API:I][0.226434] matmul desc create - no bias [CORE:I][0.226431] matmul desc init [matmul] [API:I][0.226447] matmul primitive_desc create - attr [PROF:I][0.226229] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00202,ms [API:I][0.226463] matmul primitive create [CORE:I][0.226419] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.226422] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.211694] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.213467] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.774ms graph_exe_count=-1 weight_address=0x70db92bb4040 [PROF:I][0.228039] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.79594,ms [API:I][0.228403] Memory create [CORE:V0][0.228391] Memory desc init by tag [memory] [CORE:I][0.228398] Memory created [memory] [API:I][0.228420] Memory create - strides [CORE:I][0.228406] Memory desc init by Stride [memory] [CORE:I][0.228412] Memory created [memory] [API:I][0.228434] Memory create [CORE:V0][0.228420] Memory desc init by tag [memory] [CORE:I][0.228427] Memory created [memory] [API:I][0.228450] Memory create [CORE:V0][0.228437] Memory desc init by tag [memory] [CORE:I][0.228444] Memory created [memory] [API:I][0.228469] Memory create [CORE:V0][0.228456] Memory desc init by tag [memory] [CORE:I][0.228462] Memory created [memory] [API:I][0.228396] matmul desc create - no bias [CORE:I][0.228394] matmul desc init [matmul] [API:I][0.228413] matmul primitive_desc create - attr [PROF:I][0.228195] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00202,ms [API:I][0.228427] matmul primitive create [CORE:I][0.228383] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.228388] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.213663] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.215378] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.717ms graph_exe_count=-1 weight_address=0x70dba0bb5040 [PROF:I][0.229950] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.74317,ms [API:I][0.230319] Memory create [CORE:V0][0.230307] Memory desc init by tag [memory] [CORE:I][0.230314] Memory created [memory] [API:I][0.230337] Memory create - strides [CORE:I][0.230323] Memory desc init by Stride [memory] [CORE:I][0.230328] Memory created [memory] [API:I][0.230351] Memory create [CORE:V0][0.230340] Memory desc init by tag [memory] [CORE:I][0.230346] Memory created [memory] [API:I][0.230276] matmul desc create - no bias [CORE:I][0.230275] matmul desc init [matmul] [API:I][0.230292] matmul primitive_desc create - attr [PROF:I][0.230073] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00179,ms [API:I][0.230305] matmul primitive create [CORE:I][0.230259] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.230262] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.215534] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.217140] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.606ms graph_exe_count=-1 weight_address=0x70dbaebb6040 [PROF:I][0.231710] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.62723,ms [API:I][0.232196] Memory create [CORE:V0][0.232186] Memory desc init by tag [memory] [CORE:I][0.232192] Memory created [memory] [API:I][0.232215] Memory create - strides [CORE:I][0.232200] Memory desc init by Stride [memory] [CORE:I][0.232205] Memory created [memory] [API:I][0.232227] Memory create [CORE:V0][0.232213] Memory desc init by tag [memory] [CORE:I][0.232219] Memory created [memory] [API:I][0.232245] Memory create [CORE:V0][0.232232] Memory desc init by tag [memory] [CORE:I][0.232237] Memory created [memory] [API:I][0.232260] Memory create - strides [CORE:I][0.232247] Memory desc init by Stride [memory] [CORE:I][0.232252] Memory created [memory] [API:I][0.232275] Memory create [CORE:V0][0.232260] Memory desc init by tag [memory] [CORE:I][0.232263] Memory created [memory] [API:I][0.232285] Memory create [CORE:V0][0.232270] Memory desc init by tag [memory] [CORE:I][0.232275] Memory created [memory] [API:I][0.232298] Memory create - strides [CORE:I][0.232286] Memory desc init by Stride [memory] [CORE:I][0.232291] Memory created [memory] [API:I][0.232313] Memory create [CORE:V0][0.232300] Memory desc init by tag [memory] [CORE:I][0.232307] Memory created [memory] [API:I][0.217564] CPU Engine create [CORE:V0][0.232386] CPU Engine created [engine] [CORE:I][0.232389] CPU Engine created [cpu/engine] [API:I][0.217575] CPU Stream create [CORE:I][0.231998] CPU Stream created [stream] [CORE:V0][0.232000] CPU Stream created [cpu/stream] [API:I][0.217597] matmul desc create - no bias [CORE:I][0.232268] matmul desc init [matmul] [API:I][0.217612] matmul primitive_desc create - attr [PROF:I][0.232066] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00211,ms [API:I][0.217626] matmul primitive create [CORE:I][0.232254] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.232259] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.217535] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.218025] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.49ms graph_exe_count=-1 weight_address=0x70db74baf040 [PROF:I][0.232596] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.518287,ms [API:I][0.218162] matmul desc create - no bias [CORE:I][0.232834] matmul desc init [matmul] [API:I][0.218177] matmul primitive_desc create - attr [PROF:I][0.232632] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00205,ms [API:I][0.218192] matmul primitive create [CORE:I][0.232828] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.232833] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.218104] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.218247] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.143ms graph_exe_count=-1 weight_address=0x4b2272c0 [PROF:I][0.232818] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.164756,ms [API:I][0.218381] matmul desc create - no bias [CORE:I][0.233052] matmul desc init [matmul] [API:I][0.218396] matmul primitive_desc create - attr [PROF:I][0.232850] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00189,ms [API:I][0.218410] matmul primitive create [CORE:I][0.233040] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.233045] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.218320] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.218472] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.152ms graph_exe_count=-1 weight_address=0x4c227300 [PROF:I][0.233042] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.178096,ms [PROF:V0][0.218605] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.04102,ms [CORE:I][0.233030] CPU Stream deleted [stream] [CORE:I][0.233436] CPU Engine deleted [engine] [API:I][0.233718] Memory create [CORE:V0][0.233707] Memory desc init by tag [memory] [CORE:I][0.233714] Memory created [memory] [API:I][0.233737] Memory create - strides [CORE:I][0.233722] Memory desc init by Stride [memory] [CORE:I][0.233726] Memory created [memory] [API:I][0.233748] Memory create [CORE:V0][0.233736] Memory desc init by tag [memory] [CORE:I][0.233742] Memory created [memory] [API:I][0.233673] matmul desc create - no bias [CORE:I][0.233673] matmul desc init [matmul] [API:I][0.233691] matmul primitive_desc create - attr [PROF:I][0.233475] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00278,ms [API:I][0.233707] matmul primitive create [CORE:I][0.233663] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.233667] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.218939] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.219474] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.534ms graph_exe_count=-1 weight_address=0x70db78bb0040 [PROF:I][0.234044] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.556878,ms [API:I][0.234469] Memory create [CORE:V0][0.234457] Memory desc init by tag [memory] [CORE:I][0.234464] Memory created [memory] [API:I][0.234486] Memory create - strides [CORE:I][0.234473] Memory desc init by Stride [memory] [CORE:I][0.234478] Memory created [memory] [API:I][0.234500] Memory create [CORE:V0][0.234487] Memory desc init by tag [memory] [CORE:I][0.234494] Memory created [memory] [API:I][0.234424] matmul desc create - no bias [CORE:I][0.234421] matmul desc init [matmul] [API:I][0.234438] matmul primitive_desc create - attr [PROF:I][0.234221] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00249,ms [API:I][0.234454] matmul primitive create [CORE:I][0.234410] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.234415] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.219690] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.221474] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.785ms graph_exe_count=-1 weight_address=0x70d9331fe040 [PROF:I][0.236046] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.81155,ms [API:I][0.236412] Memory create [CORE:V0][0.236400] Memory desc init by tag [memory] [CORE:I][0.236407] Memory created [memory] [API:I][0.236429] Memory create - strides [CORE:I][0.236415] Memory desc init by Stride [memory] [CORE:I][0.236421] Memory created [memory] [API:I][0.236444] Memory create [CORE:V0][0.236430] Memory desc init by tag [memory] [CORE:I][0.236438] Memory created [memory] [API:I][0.236462] Memory create [CORE:V0][0.236447] Memory desc init by tag [memory] [CORE:I][0.236452] Memory created [memory] [API:I][0.236478] Memory create [CORE:V0][0.236465] Memory desc init by tag [memory] [CORE:I][0.236471] Memory created [memory] [API:I][0.236405] matmul desc create - no bias [CORE:I][0.236403] matmul desc init [matmul] [API:I][0.236421] matmul primitive_desc create - attr [PROF:I][0.236202] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00225,ms [API:I][0.236435] matmul primitive create [CORE:I][0.236390] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.236395] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.221670] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.223446] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.777ms graph_exe_count=-1 weight_address=0x70d9411ff040 [PROF:I][0.238018] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.80344,ms [API:I][0.238387] Memory create [CORE:V0][0.238374] Memory desc init by tag [memory] [CORE:I][0.238381] Memory created [memory] [API:I][0.238404] Memory create - strides [CORE:I][0.238390] Memory desc init by Stride [memory] [CORE:I][0.238395] Memory created [memory] [API:I][0.238419] Memory create [CORE:V0][0.238407] Memory desc init by tag [memory] [CORE:I][0.238414] Memory created [memory] [API:I][0.238344] matmul desc create - no bias [CORE:I][0.238344] matmul desc init [matmul] [API:I][0.238364] matmul primitive_desc create - attr [PROF:I][0.238145] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00224,ms [API:I][0.238379] matmul primitive create [CORE:I][0.238336] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.238339] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.223611] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.225179] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.569ms graph_exe_count=-1 weight_address=0x70db7cbb1040 [PROF:I][0.239749] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.58962,ms [API:I][0.240216] Memory create [CORE:V0][0.240206] Memory desc init by tag [memory] [CORE:I][0.240213] Memory created [memory] [API:I][0.240235] Memory create - strides [CORE:I][0.240222] Memory desc init by Stride [memory] [CORE:I][0.240227] Memory created [memory] [API:I][0.240250] Memory create [CORE:V0][0.240236] Memory desc init by tag [memory] [CORE:I][0.240243] Memory created [memory] [API:I][0.240267] Memory create [CORE:V0][0.240252] Memory desc init by tag [memory] [CORE:I][0.240257] Memory created [memory] [API:I][0.240280] Memory create - strides [CORE:I][0.240267] Memory desc init by Stride [memory] [CORE:I][0.240273] Memory created [memory] [API:I][0.240298] Memory create [CORE:V0][0.240285] Memory desc init by tag [memory] [CORE:I][0.240290] Memory created [memory] [API:I][0.240313] Memory create [CORE:V0][0.240300] Memory desc init by tag [memory] [CORE:I][0.240307] Memory created [memory] [API:I][0.240329] Memory create - strides [CORE:I][0.240315] Memory desc init by Stride [memory] [CORE:I][0.240320] Memory created [memory] [API:I][0.240342] Memory create [CORE:V0][0.240329] Memory desc init by tag [memory] [CORE:I][0.240334] Memory created [memory] [API:I][0.225591] CPU Engine create [CORE:V0][0.240413] CPU Engine created [engine] [CORE:I][0.240417] CPU Engine created [cpu/engine] [API:I][0.225606] CPU Stream create [CORE:I][0.240029] CPU Stream created [stream] [CORE:V0][0.240030] CPU Stream created [cpu/stream] [API:I][0.225628] matmul desc create - no bias [CORE:I][0.240299] matmul desc init [matmul] [API:I][0.225644] matmul primitive_desc create - attr [PROF:I][0.240101] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00324,ms [API:I][0.225662] matmul primitive create [CORE:I][0.240290] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.240295] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.225568] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.226050] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.483ms graph_exe_count=-1 weight_address=0x70d9011f9040 [PROF:I][0.240621] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.506576,ms [API:I][0.226183] matmul desc create - no bias [CORE:I][0.240855] matmul desc init [matmul] [API:I][0.226207] matmul primitive_desc create - attr [PROF:I][0.240661] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00192,ms [API:I][0.226221] matmul primitive create [CORE:I][0.240850] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.240855] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.226129] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.226279] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.15ms graph_exe_count=-1 weight_address=0x4d22f3c0 [PROF:I][0.240848] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.173626,ms [API:I][0.226410] matmul desc create - no bias [CORE:I][0.241082] matmul desc init [matmul] [API:I][0.226423] matmul primitive_desc create - attr [PROF:I][0.240877] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00153,ms [API:I][0.226438] matmul primitive create [CORE:I][0.241066] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.241070] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.226344] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.226495] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.15ms graph_exe_count=-1 weight_address=0x4e22f400 [PROF:I][0.241064] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.173706,ms [PROF:V0][0.226625] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.03491,ms [CORE:I][0.241051] CPU Stream deleted [stream] [CORE:I][0.241457] CPU Engine deleted [engine] [API:I][0.241718] Memory create [CORE:V0][0.241707] Memory desc init by tag [memory] [CORE:I][0.241714] Memory created [memory] [API:I][0.241737] Memory create - strides [CORE:I][0.241725] Memory desc init by Stride [memory] [CORE:I][0.241730] Memory created [memory] [API:I][0.241752] Memory create [CORE:V0][0.241740] Memory desc init by tag [memory] [CORE:I][0.241745] Memory created [memory] [API:I][0.241677] matmul desc create - no bias [CORE:I][0.241676] matmul desc init [matmul] [API:I][0.241692] matmul primitive_desc create - attr [PROF:I][0.241475] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00239,ms [API:I][0.241709] matmul primitive create [CORE:I][0.241664] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.241668] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.226940] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.227486] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.546ms graph_exe_count=-1 weight_address=0x70d9051fa040 [PROF:I][0.242055] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.567009,ms [API:I][0.242452] Memory create [CORE:V0][0.242440] Memory desc init by tag [memory] [CORE:I][0.242447] Memory created [memory] [API:I][0.242469] Memory create - strides [CORE:I][0.242455] Memory desc init by Stride [memory] [CORE:I][0.242461] Memory created [memory] [API:I][0.242483] Memory create [CORE:V0][0.242470] Memory desc init by tag [memory] [CORE:I][0.242478] Memory created [memory] [API:I][0.242408] matmul desc create - no bias [CORE:I][0.242405] matmul desc init [matmul] [API:I][0.242421] matmul primitive_desc create - attr [PROF:I][0.242203] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00221,ms [API:I][0.242436] matmul primitive create [CORE:I][0.242391] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.242395] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.227667] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.229498] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.831ms graph_exe_count=-1 weight_address=0x70d9091fb040 [PROF:I][0.244070] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.85401,ms [API:I][0.244432] Memory create [CORE:V0][0.244421] Memory desc init by tag [memory] [CORE:I][0.244427] Memory created [memory] [API:I][0.244450] Memory create - strides [CORE:I][0.244438] Memory desc init by Stride [memory] [CORE:I][0.244444] Memory created [memory] [API:I][0.244467] Memory create [CORE:V0][0.244454] Memory desc init by tag [memory] [CORE:I][0.244459] Memory created [memory] [API:I][0.244483] Memory create [CORE:V0][0.244471] Memory desc init by tag [memory] [CORE:I][0.244477] Memory created [memory] [API:I][0.244502] Memory create [CORE:V0][0.244489] Memory desc init by tag [memory] [CORE:I][0.244495] Memory created [memory] [API:I][0.244429] matmul desc create - no bias [CORE:I][0.244427] matmul desc init [matmul] [API:I][0.244445] matmul primitive_desc create - attr [PROF:I][0.244228] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.0026,ms [API:I][0.244462] matmul primitive create [CORE:I][0.244419] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.244424] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.229698] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.231475] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.778ms graph_exe_count=-1 weight_address=0x70d9171fc040 [PROF:I][0.246046] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.80318,ms [API:I][0.246416] Memory create [CORE:V0][0.246404] Memory desc init by tag [memory] [CORE:I][0.246411] Memory created [memory] [API:I][0.246433] Memory create - strides [CORE:I][0.246420] Memory desc init by Stride [memory] [CORE:I][0.246425] Memory created [memory] [API:I][0.246448] Memory create [CORE:V0][0.246437] Memory desc init by tag [memory] [CORE:I][0.246443] Memory created [memory] [API:I][0.246374] matmul desc create - no bias [CORE:I][0.246374] matmul desc init [matmul] [API:I][0.246391] matmul primitive_desc create - attr [PROF:I][0.246174] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00226,ms [API:I][0.246406] matmul primitive create [CORE:I][0.246363] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.246367] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.231638] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.233197] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.559ms graph_exe_count=-1 weight_address=0x70d9251fd040 [PROF:I][0.247768] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.58032,ms [API:I][0.248238] Memory create [CORE:V0][0.248228] Memory desc init by tag [memory] [CORE:I][0.248235] Memory created [memory] [API:I][0.248258] Memory create - strides [CORE:I][0.248247] Memory desc init by Stride [memory] [CORE:I][0.248252] Memory created [memory] [API:I][0.248273] Memory create [CORE:V0][0.248258] Memory desc init by tag [memory] [CORE:I][0.248263] Memory created [memory] [API:I][0.248288] Memory create [CORE:V0][0.248285] Memory desc init by tag [memory] [CORE:I][0.248293] Memory created [memory] [API:I][0.248315] Memory create - strides [CORE:I][0.248302] Memory desc init by Stride [memory] [CORE:I][0.248310] Memory created [memory] [API:I][0.248335] Memory create [CORE:V0][0.248322] Memory desc init by tag [memory] [CORE:I][0.248336] Memory created [memory] [API:I][0.248360] Memory create [CORE:V0][0.248347] Memory desc init by tag [memory] [CORE:I][0.248353] Memory created [memory] [API:I][0.248377] Memory create - strides [CORE:I][0.248363] Memory desc init by Stride [memory] [CORE:I][0.248369] Memory created [memory] [API:I][0.248391] Memory create [CORE:V0][0.248378] Memory desc init by tag [memory] [CORE:I][0.248384] Memory created [memory] [API:I][0.233642] CPU Engine create [CORE:V0][0.248464] CPU Engine created [engine] [CORE:I][0.248476] CPU Engine created [cpu/engine] [API:I][0.233661] CPU Stream create [CORE:I][0.248083] CPU Stream created [stream] [CORE:V0][0.248084] CPU Stream created [cpu/stream] [API:I][0.233681] matmul desc create - no bias [CORE:I][0.248353] matmul desc init [matmul] [API:I][0.233696] matmul primitive_desc create - attr [PROF:I][0.248151] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00249,ms [API:I][0.233710] matmul primitive create [CORE:I][0.248336] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.248340] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.233612] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.234185] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.573ms graph_exe_count=-1 weight_address=0x70d8cf1f4040 [PROF:I][0.248756] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.59539,ms [API:I][0.234318] matmul desc create - no bias [CORE:I][0.248990] matmul desc init [matmul] [API:I][0.234332] matmul primitive_desc create - attr [PROF:I][0.248786] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00188,ms [API:I][0.234347] matmul primitive create [CORE:I][0.248974] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.248979] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.234252] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.234395] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.143ms graph_exe_count=-1 weight_address=0x4f2374c0 [PROF:I][0.248965] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.166876,ms [API:I][0.234529] matmul desc create - no bias [CORE:I][0.249199] matmul desc init [matmul] [API:I][0.234540] matmul primitive_desc create - attr [PROF:I][0.248993] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00151,ms [API:I][0.234554] matmul primitive create [CORE:I][0.249183] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.249188] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.234461] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.234603] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.142ms graph_exe_count=-1 weight_address=0x50237500 [PROF:I][0.249172] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.164676,ms [PROF:V0][0.234736] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.09497,ms [CORE:I][0.249162] CPU Stream deleted [stream] [CORE:I][0.249567] CPU Engine deleted [engine] [API:I][0.249840] Memory create [CORE:V0][0.249830] Memory desc init by tag [memory] [CORE:I][0.249837] Memory created [memory] [API:I][0.249859] Memory create - strides [CORE:I][0.249844] Memory desc init by Stride [memory] [CORE:I][0.249849] Memory created [memory] [API:I][0.249871] Memory create [CORE:V0][0.249858] Memory desc init by tag [memory] [CORE:I][0.249865] Memory created [memory] [API:I][0.249796] matmul desc create - no bias [CORE:I][0.249795] matmul desc init [matmul] [API:I][0.249812] matmul primitive_desc create - attr [PROF:I][0.249595] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.002411,ms [API:I][0.249828] matmul primitive create [CORE:I][0.249784] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.249788] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.235060] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.235590] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.53ms graph_exe_count=-1 weight_address=0x70d8d31f5040 [PROF:I][0.250160] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.551708,ms [API:I][0.250573] Memory create [CORE:V0][0.250561] Memory desc init by tag [memory] [CORE:I][0.250568] Memory created [memory] [API:I][0.250590] Memory create - strides [CORE:I][0.250579] Memory desc init by Stride [memory] [CORE:I][0.250584] Memory created [memory] [API:I][0.250607] Memory create [CORE:V0][0.250594] Memory desc init by tag [memory] [CORE:I][0.250600] Memory created [memory] [API:I][0.250530] matmul desc create - no bias [CORE:I][0.250529] matmul desc init [matmul] [API:I][0.250546] matmul primitive_desc create - attr [PROF:I][0.250328] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00222,ms [API:I][0.250562] matmul primitive create [CORE:I][0.250518] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.250523] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.235797] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.237653] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.856ms graph_exe_count=-1 weight_address=0x70d8d71f6040 [PROF:I][0.252225] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.88308,ms [API:I][0.252591] Memory create [CORE:V0][0.252579] Memory desc init by tag [memory] [CORE:I][0.252585] Memory created [memory] [API:I][0.252608] Memory create - strides [CORE:I][0.252594] Memory desc init by Stride [memory] [CORE:I][0.252599] Memory created [memory] [API:I][0.252623] Memory create [CORE:V0][0.252611] Memory desc init by tag [memory] [CORE:I][0.252617] Memory created [memory] [API:I][0.252640] Memory create [CORE:V0][0.252627] Memory desc init by tag [memory] [CORE:I][0.252633] Memory created [memory] [API:I][0.252659] Memory create [CORE:V0][0.252647] Memory desc init by tag [memory] [CORE:I][0.252652] Memory created [memory] [API:I][0.252585] matmul desc create - no bias [CORE:I][0.252584] matmul desc init [matmul] [API:I][0.252601] matmul primitive_desc create - attr [PROF:I][0.252382] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00193,ms [API:I][0.252615] matmul primitive create [CORE:I][0.252570] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.252575] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.237850] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.239566] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.717ms graph_exe_count=-1 weight_address=0x70d8e51f7040 [PROF:I][0.254137] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.74242,ms [API:I][0.254504] Memory create [CORE:V0][0.254492] Memory desc init by tag [memory] [CORE:I][0.254499] Memory created [memory] [API:I][0.254522] Memory create - strides [CORE:I][0.254508] Memory desc init by Stride [memory] [CORE:I][0.254514] Memory created [memory] [API:I][0.254536] Memory create [CORE:V0][0.254523] Memory desc init by tag [memory] [CORE:I][0.254528] Memory created [memory] [API:I][0.254458] matmul desc create - no bias [CORE:I][0.254458] matmul desc init [matmul] [API:I][0.254475] matmul primitive_desc create - attr [PROF:I][0.254258] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.002241,ms [API:I][0.254491] matmul primitive create [CORE:I][0.254449] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.254454] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.239728] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.241295] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.568ms graph_exe_count=-1 weight_address=0x70d8f31f8040 [PROF:I][0.255867] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.59386,ms [API:I][0.256331] Memory create [CORE:V0][0.256320] Memory desc init by tag [memory] [CORE:I][0.256327] Memory created [memory] [API:I][0.256350] Memory create - strides [CORE:I][0.256339] Memory desc init by Stride [memory] [CORE:I][0.256344] Memory created [memory] [API:I][0.256368] Memory create [CORE:V0][0.256355] Memory desc init by tag [memory] [CORE:I][0.256362] Memory created [memory] [API:I][0.256386] Memory create [CORE:V0][0.256371] Memory desc init by tag [memory] [CORE:I][0.256377] Memory created [memory] [API:I][0.256400] Memory create - strides [CORE:I][0.256386] Memory desc init by Stride [memory] [CORE:I][0.256392] Memory created [memory] [API:I][0.256416] Memory create [CORE:V0][0.256403] Memory desc init by tag [memory] [CORE:I][0.256410] Memory created [memory] [API:I][0.256436] Memory create [CORE:V0][0.256424] Memory desc init by tag [memory] [CORE:I][0.256429] Memory created [memory] [API:I][0.256452] Memory create - strides [CORE:I][0.256441] Memory desc init by Stride [memory] [CORE:I][0.256446] Memory created [memory] [API:I][0.256468] Memory create [CORE:V0][0.256455] Memory desc init by tag [memory] [CORE:I][0.256461] Memory created [memory] [API:I][0.241720] CPU Engine create [CORE:V0][0.256544] CPU Engine created [engine] [CORE:I][0.256551] CPU Engine created [cpu/engine] [API:I][0.241738] CPU Stream create [CORE:I][0.256161] CPU Stream created [stream] [CORE:V0][0.256162] CPU Stream created [cpu/stream] [API:I][0.241758] matmul desc create - no bias [CORE:I][0.256429] matmul desc init [matmul] [API:I][0.241773] matmul primitive_desc create - attr [PROF:I][0.256228] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00265,ms [API:I][0.241788] matmul primitive create [CORE:I][0.256415] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.256420] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.241696] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.242203] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.508ms graph_exe_count=-1 weight_address=0x70d89d1ef040 [PROF:I][0.256773] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.534098,ms [API:I][0.242339] matmul desc create - no bias [CORE:I][0.257010] matmul desc init [matmul] [API:I][0.242354] matmul primitive_desc create - attr [PROF:I][0.256808] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00188,ms [API:I][0.242368] matmul primitive create [CORE:I][0.256997] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.257002] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.242277] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.242424] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.148ms graph_exe_count=-1 weight_address=0x5123f5c0 [PROF:I][0.256994] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.173006,ms [API:I][0.242557] matmul desc create - no bias [CORE:I][0.257229] matmul desc init [matmul] [API:I][0.242573] matmul primitive_desc create - attr [PROF:I][0.257026] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00145,ms [API:I][0.242586] matmul primitive create [CORE:I][0.257213] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.257218] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.242492] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.242637] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.145ms graph_exe_count=-1 weight_address=0x5223f600 [PROF:I][0.257207] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.168846,ms [PROF:V0][0.242771] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.05103,ms [CORE:I][0.257197] CPU Stream deleted [stream] [CORE:I][0.257602] CPU Engine deleted [engine] [API:I][0.257871] Memory create [CORE:V0][0.257861] Memory desc init by tag [memory] [CORE:I][0.257868] Memory created [memory] [API:I][0.257890] Memory create - strides [CORE:I][0.257879] Memory desc init by Stride [memory] [CORE:I][0.257884] Memory created [memory] [API:I][0.257907] Memory create [CORE:V0][0.257893] Memory desc init by tag [memory] [CORE:I][0.257898] Memory created [memory] [API:I][0.257831] matmul desc create - no bias [CORE:I][0.257830] matmul desc init [matmul] [API:I][0.257847] matmul primitive_desc create - attr [PROF:I][0.257629] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00255,ms [API:I][0.257862] matmul primitive create [CORE:I][0.257826] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.257830] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.243102] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.243588] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.487ms graph_exe_count=-1 weight_address=0x70d8a11f0040 [PROF:I][0.258158] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.514817,ms [API:I][0.258563] Memory create [CORE:V0][0.258550] Memory desc init by tag [memory] [CORE:I][0.258555] Memory created [memory] [API:I][0.258579] Memory create - strides [CORE:I][0.258567] Memory desc init by Stride [memory] [CORE:I][0.258572] Memory created [memory] [API:I][0.258595] Memory create [CORE:V0][0.258582] Memory desc init by tag [memory] [CORE:I][0.258587] Memory created [memory] [API:I][0.258518] matmul desc create - no bias [CORE:I][0.258517] matmul desc init [matmul] [API:I][0.258532] matmul primitive_desc create - attr [PROF:I][0.258314] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00242,ms [API:I][0.258548] matmul primitive create [CORE:I][0.258504] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.258508] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.243782] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.245474] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.692ms graph_exe_count=-1 weight_address=0x70d8a51f1040 [PROF:I][0.260047] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.71935,ms [API:I][0.260411] Memory create [CORE:V0][0.260399] Memory desc init by tag [memory] [CORE:I][0.260405] Memory created [memory] [API:I][0.260428] Memory create - strides [CORE:I][0.260414] Memory desc init by Stride [memory] [CORE:I][0.260420] Memory created [memory] [API:I][0.260442] Memory create [CORE:V0][0.260429] Memory desc init by tag [memory] [CORE:I][0.260434] Memory created [memory] [API:I][0.260457] Memory create [CORE:V0][0.260445] Memory desc init by tag [memory] [CORE:I][0.260452] Memory created [memory] [API:I][0.260478] Memory create [CORE:V0][0.260465] Memory desc init by tag [memory] [CORE:I][0.260471] Memory created [memory] [API:I][0.260403] matmul desc create - no bias [CORE:I][0.260401] matmul desc init [matmul] [API:I][0.260419] matmul primitive_desc create - attr [PROF:I][0.260201] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.00244,ms [API:I][0.260434] matmul primitive create [CORE:I][0.260388] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.260393] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.245669] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.247377] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.71ms graph_exe_count=-1 weight_address=0x70d8b31f2040 [PROF:I][0.261949] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.73675,ms [API:I][0.262321] Memory create [CORE:V0][0.262308] Memory desc init by tag [memory] [CORE:I][0.262315] Memory created [memory] [API:I][0.262337] Memory create - strides [CORE:I][0.262324] Memory desc init by Stride [memory] [CORE:I][0.262329] Memory created [memory] [API:I][0.262352] Memory create [CORE:V0][0.262339] Memory desc init by tag [memory] [CORE:I][0.262344] Memory created [memory] [API:I][0.262276] matmul desc create - no bias [CORE:I][0.262275] matmul desc init [matmul] [API:I][0.262291] matmul primitive_desc create - attr [PROF:I][0.262072] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.001861,ms [API:I][0.262304] matmul primitive create [CORE:I][0.262257] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.262261] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.247532] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.249079] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.547ms graph_exe_count=-1 weight_address=0x70d8c11f3040 [PROF:I][0.263658] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.57704,ms [API:I][0.264132] Memory create [CORE:V0][0.264122] Memory desc init by tag [memory] [CORE:I][0.264129] Memory created [memory] [API:I][0.264151] Memory create - strides [CORE:I][0.264140] Memory desc init by Stride [memory] [CORE:I][0.264145] Memory created [memory] [API:I][0.264169] Memory create [CORE:V0][0.264155] Memory desc init by tag [memory] [CORE:I][0.264161] Memory created [memory] [API:I][0.264185] Memory create [CORE:V0][0.264172] Memory desc init by tag [memory] [CORE:I][0.264178] Memory created [memory] [API:I][0.264201] Memory create - strides [CORE:I][0.264187] Memory desc init by Stride [memory] [CORE:I][0.264192] Memory created [memory] [API:I][0.264216] Memory create [CORE:V0][0.264202] Memory desc init by tag [memory] [CORE:I][0.264210] Memory created [memory] [API:I][0.264235] Memory create [CORE:V0][0.264222] Memory desc init by tag [memory] [CORE:I][0.264227] Memory created [memory] [API:I][0.264249] Memory create - strides [CORE:I][0.264236] Memory desc init by Stride [memory] [CORE:I][0.264243] Memory created [memory] [API:I][0.264267] Memory create [CORE:V0][0.264253] Memory desc init by tag [memory] [CORE:I][0.264260] Memory created [memory] [API:I][0.249517] CPU Engine create [CORE:V0][0.264341] CPU Engine created [engine] [CORE:I][0.264346] CPU Engine created [cpu/engine] [API:I][0.249534] CPU Stream create [CORE:I][0.263956] CPU Stream created [stream] [CORE:V0][0.263958] CPU Stream created [cpu/stream] [API:I][0.249554] matmul desc create - no bias [CORE:I][0.264224] matmul desc init [matmul] [API:I][0.249568] matmul primitive_desc create - attr [PROF:I][0.264023] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.00206,ms [API:I][0.249583] matmul primitive create [CORE:I][0.264209] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.264212] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.249485] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.250003] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.519ms graph_exe_count=-1 weight_address=0x70d86b1ea040 [PROF:I][0.264573] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.539888,ms [API:I][0.250136] matmul desc create - no bias [CORE:I][0.264807] matmul desc init [matmul] [API:I][0.250151] matmul primitive_desc create - attr [PROF:I][0.264605] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00196,ms [API:I][0.250165] matmul primitive create [CORE:I][0.264794] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.264798] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.250072] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.250225] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.153ms graph_exe_count=-1 weight_address=0x532476c0 [PROF:I][0.264796] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.177516,ms [API:I][0.250361] matmul desc create - no bias [CORE:I][0.265033] matmul desc init [matmul] [API:I][0.250375] matmul primitive_desc create - attr [PROF:I][0.264829] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.00174,ms [API:I][0.250389] matmul primitive create [CORE:I][0.265017] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.265022] M: 8 N: 1024 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.250296] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.250442] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=1024 lda=4096 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.146ms graph_exe_count=-1 weight_address=0x54247700 [PROF:I][0.265012] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x1024:8x1024,0.171216,ms [PROF:V0][0.250576] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,1.05786,ms [CORE:I][0.265001] CPU Stream deleted [stream] [CORE:I][0.265406] CPU Engine deleted [engine] [API:I][0.265689] Memory create [CORE:V0][0.265677] Memory desc init by tag [memory] [CORE:I][0.265685] Memory created [memory] [API:I][0.265707] Memory create - strides [CORE:I][0.265696] Memory desc init by Stride [memory] [CORE:I][0.265701] Memory created [memory] [API:I][0.265725] Memory create [CORE:V0][0.265711] Memory desc init by tag [memory] [CORE:I][0.265716] Memory created [memory] [API:I][0.265648] matmul desc create - no bias [CORE:I][0.265647] matmul desc init [matmul] [API:I][0.265665] matmul primitive_desc create - attr [PROF:I][0.265446] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.0019,ms [API:I][0.265678] matmul primitive create [CORE:I][0.265633] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.265638] M: 8 N: 4096 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.250912] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.251472] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=4096 lda=4096 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.56ms graph_exe_count=-1 weight_address=0x70d86f1eb040 [PROF:I][0.266043] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x4096:8x4096,0.58568,ms [API:I][0.266455] Memory create [CORE:V0][0.266443] Memory desc init by tag [memory] [CORE:I][0.266450] Memory created [memory] [API:I][0.266472] Memory create - strides [CORE:I][0.266461] Memory desc init by Stride [memory] [CORE:I][0.266466] Memory created [memory] [API:I][0.266489] Memory create [CORE:V0][0.266476] Memory desc init by tag [memory] [CORE:I][0.266482] Memory created [memory] [API:I][0.266412] matmul desc create - no bias [CORE:I][0.266412] matmul desc init [matmul] [API:I][0.266428] matmul primitive_desc create - attr [PROF:I][0.266209] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,0.00184,ms [API:I][0.266441] matmul primitive create [CORE:I][0.266395] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.266400] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.251674] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.253465] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.792ms graph_exe_count=-1 weight_address=0x70d8731ec040 [PROF:I][0.268036] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x4096:4096x14336:8x14336,1.81683,ms [API:I][0.268402] Memory create [CORE:V0][0.268390] Memory desc init by tag [memory] [CORE:I][0.268396] Memory created [memory] [API:I][0.268418] Memory create - strides [CORE:I][0.268405] Memory desc init by Stride [memory] [CORE:I][0.268410] Memory created [memory] [API:I][0.268432] Memory create [CORE:V0][0.268419] Memory desc init by tag [memory] [CORE:I][0.268423] Memory created [memory] [API:I][0.268448] Memory create [CORE:V0][0.268435] Memory desc init by tag [memory] [CORE:I][0.268442] Memory created [memory] [API:I][0.268467] Memory create [CORE:V0][0.268453] Memory desc init by tag [memory] [CORE:I][0.268458] Memory created [memory] [API:I][0.268392] matmul desc create - no bias [CORE:I][0.268391] matmul desc init [matmul] [API:I][0.268410] matmul primitive_desc create - attr [PROF:I][0.268192] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,0.002261,ms [API:I][0.268426] matmul primitive create [CORE:I][0.268383] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.268388] M: 8 N: 14336 K: 4096 transA: N transB: T lda: 4096 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.253663] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.255370] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=4096 n=14336 lda=4096 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.709ms graph_exe_count=-1 weight_address=0x70d8811ed040 [PROF:I][0.269942] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:3:ab ,,8x4096:4096x14336:8x14336,1.73504,ms [API:I][0.270312] Memory create [CORE:V0][0.270300] Memory desc init by tag [memory] [CORE:I][0.270307] Memory created [memory] [API:I][0.270329] Memory create - strides [CORE:I][0.270315] Memory desc init by Stride [memory] [CORE:I][0.270321] Memory created [memory] [API:I][0.270344] Memory create [CORE:V0][0.270330] Memory desc init by tag [memory] [CORE:I][0.270335] Memory created [memory] [API:I][0.270266] matmul desc create - no bias [CORE:I][0.270266] matmul desc init [matmul] [API:I][0.270284] matmul primitive_desc create - attr [PROF:I][0.270066] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,0.00245,ms [API:I][0.270298] matmul primitive create [CORE:I][0.270252] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.270255] M: 8 N: 4096 K: 14336 transA: N transB: T lda: 14336 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.255527] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.257152] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasNoTrans, transb=CblasTrans, m=8 k=14336 n=4096 lda=14336 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.625ms graph_exe_count=-1 weight_address=0x70d88f1ee040 [PROF:I][0.271724] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,8x14336:14336x4096:8x4096,1.64781,ms [API:I][0.272118] Memory create [CORE:V0][0.272106] Memory desc init by tag [memory] [CORE:I][0.272113] Memory created [memory] [API:I][0.272136] Memory create - strides [CORE:I][0.272125] Memory desc init by Stride [memory] [CORE:I][0.272131] Memory created [memory] [API:I][0.272153] Memory create [CORE:V0][0.272139] Memory desc init by tag [memory] [CORE:I][0.272145] Memory created [memory] [API:I][0.272076] matmul desc create - no bias [CORE:I][0.272076] matmul desc init [matmul] [CORE:V0][0.272043] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][0.272178] Memory desc init by tag [memory] [CORE:V0][0.272181] Memory desc init by tag [memory] [CORE:V0][0.272187] Memory desc init by tag [memory] [CORE:V0][0.272192] Memory desc init by tag [memory] [CORE:V0][0.272069] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][0.272127] matmul primitive_desc create - attr [PROF:I][0.271928] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x32000:1x32000,0.00894,ms [API:I][0.272160] matmul primitive create [CORE:I][0.272113] zendnn_f32_matmul_t::execute_ref [CORE:V0][0.272117] M: 1 N: 32000 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 32000 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][0.257388] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][0.281003] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=32000 lda=1 ldb=4096 ldc=32000 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=23.615ms graph_exe_count=-1 weight_address=0x70dc7cbc9040 [PROF:I][0.295606] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x32000:1x32000,23.6684,ms [API:I][12.136671] Memory create [CORE:V0][12.136673] Memory desc init by tag [memory] [CORE:I][12.136685] Memory created [memory] [API:I][12.136707] Memory create [CORE:V0][12.136692] Memory desc init by tag [memory] [CORE:I][12.136696] Memory created [memory] [API:I][12.136719] Memory create [CORE:V0][12.136704] Memory desc init by tag [memory] [CORE:I][12.136709] Memory created [memory] [API:I][12.136646] matmul desc create - no bias [CORE:I][12.136645] matmul desc init [matmul] [CORE:V0][12.136618] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][12.136753] Memory desc init by tag [memory] [CORE:V0][12.136756] Memory desc init by tag [memory] [CORE:V0][12.136760] Memory desc init by tag [memory] [CORE:V0][12.136545] ZenDNN Ref gemm_f32_matmul_t::pd_t::init() [CORE:V0][12.136552] ZenDNN Ref gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][12.136717] matmul primitive_desc create - attr [PROF:I][12.136527] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_bmm,matmul,gemm:jit,undef,src_f32::blocked:abc:f0 wei_f32::blocked:abc:f0 dst_f32::blocked:abc:f0,,,1x64x1:1x1x1:1x64x1,0.014351,ms [API:I][12.136760] matmul primitive create [CORE:I][12.136615] ZenDNN Ref gemm_f32_matmul_t::execute_ref [PROF:I][12.136552] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_bmm,matmul,gemm:jit,undef,src_f32::blocked:abc:f0 wei_f32::blocked:abc:f0 dst_f32::blocked:abc:f0,,,1x64x1:1x1x1:1x64x1,0.01323,ms [API:I][12.136960] Memory create [CORE:V0][12.136959] Memory desc init by tag [memory] [CORE:I][12.136966] Memory created [memory] [API:I][12.136987] Memory create - strides [CORE:I][12.136975] Memory desc init by Stride [memory] [CORE:I][12.136981] Memory created [memory] [API:I][12.137002] Memory create [CORE:V0][12.136988] Memory desc init by tag [memory] [CORE:I][12.136995] Memory created [memory] [API:I][12.137017] Memory create [CORE:V0][12.137003] Memory desc init by tag [memory] [CORE:I][12.137008] Memory created [memory] [API:I][12.137029] Memory create - strides [CORE:I][12.137013] Memory desc init by Stride [memory] [CORE:I][12.137017] Memory created [memory] [API:I][12.137039] Memory create [CORE:V0][12.137025] Memory desc init by tag [memory] [CORE:I][12.137028] Memory created [memory] [API:I][12.137051] Memory create [CORE:V0][12.137036] Memory desc init by tag [memory] [CORE:I][12.137041] Memory created [memory] [API:I][12.137062] Memory create - strides [CORE:I][12.137047] Memory desc init by Stride [memory] [CORE:I][12.137051] Memory created [memory] [API:I][12.137072] Memory create [CORE:V0][12.137058] Memory desc init by tag [memory] [CORE:I][12.137062] Memory created [memory] [API:I][12.122319] CPU Engine create [CORE:V0][12.137140] CPU Engine created [engine] [CORE:I][12.137144] CPU Engine created [cpu/engine] [API:I][12.122330] CPU Stream create [CORE:I][12.136753] CPU Stream created [stream] [CORE:V0][12.136752] CPU Stream created [cpu/stream] [API:I][12.122347] matmul desc create - no bias [CORE:I][12.137018] matmul desc init [matmul] [CORE:V0][12.136980] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][12.137112] Memory desc init by tag [memory] [CORE:V0][12.137115] Memory desc init by tag [memory] [CORE:V0][12.137119] Memory desc init by tag [memory] [CORE:V0][12.137123] Memory desc init by tag [memory] [CORE:V0][12.136999] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][12.122385] matmul primitive_desc create - attr [PROF:I][12.136849] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.006681,ms [API:I][12.122409] matmul primitive create [CORE:I][12.137037] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.137040] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.122314] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.123954] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.641ms graph_exe_count=-1 weight_address=0x70de81ffb040 [PROF:I][12.138527] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.66656,ms [API:I][12.124091] matmul desc create - no bias [CORE:I][12.138761] matmul desc init [matmul] [CORE:V0][12.138722] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][12.138854] Memory desc init by tag [memory] [CORE:V0][12.138857] Memory desc init by tag [memory] [CORE:V0][12.138860] Memory desc init by tag [memory] [CORE:V0][12.138864] Memory desc init by tag [memory] [CORE:V0][12.138739] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][12.124123] matmul primitive_desc create - attr [PROF:I][12.138583] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00475,ms [API:I][12.124142] matmul primitive create [CORE:I][12.138767] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.138771] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.124041] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.124497] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.456ms graph_exe_count=-1 weight_address=0x1514d7c0 [PROF:I][12.139067] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.475066,ms [API:I][12.124629] matmul desc create - no bias [CORE:I][12.139298] matmul desc init [matmul] [API:I][12.124637] matmul primitive_desc create - attr [PROF:I][12.139089] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.0017,ms [API:I][12.124648] matmul primitive create [CORE:I][12.139274] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.139277] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.124547] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.124882] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.335ms graph_exe_count=-1 weight_address=0x1614d800 [PROF:I][12.139452] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.353722,ms [PROF:V0][12.125012] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.69385,ms [CORE:I][12.139436] CPU Stream deleted [stream] [CORE:I][12.139840] CPU Engine deleted [engine] [API:I][12.140264] Memory create [CORE:V0][12.140253] Memory desc init by tag [memory] [CORE:I][12.140259] Memory created [memory] [API:I][12.140282] Memory create - strides [CORE:I][12.140268] Memory desc init by Stride [memory] [CORE:I][12.140273] Memory created [memory] [API:I][12.140294] Memory create [CORE:V0][12.140280] Memory desc init by tag [memory] [CORE:I][12.140284] Memory created [memory] [API:I][12.140215] matmul desc create - no bias [CORE:I][12.140212] matmul desc init [matmul] [CORE:V0][12.140175] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][12.140308] Memory desc init by tag [memory] [CORE:V0][12.140312] Memory desc init by tag [memory] [CORE:V0][12.140315] Memory desc init by tag [memory] [CORE:V0][12.140319] Memory desc init by tag [memory] [CORE:V0][12.140195] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][12.140251] matmul primitive_desc create - attr [PROF:I][12.140043] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00641,ms [API:I][12.140275] matmul primitive create [CORE:I][12.140230] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.140233] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.125505] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.127176] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.672ms graph_exe_count=-1 weight_address=0x70de85ffc040 [PROF:I][12.141748] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.69336,ms [API:I][12.142116] Memory create [CORE:V0][12.142104] Memory desc init by tag [memory] [CORE:I][12.142109] Memory created [memory] [API:I][12.142130] Memory create - strides [CORE:I][12.142116] Memory desc init by Stride [memory] [CORE:I][12.142121] Memory created [memory] [API:I][12.142143] Memory create [CORE:V0][12.142129] Memory desc init by tag [memory] [CORE:I][12.142133] Memory created [memory] [API:I][12.142064] matmul desc create - no bias [CORE:I][12.142061] matmul desc init [matmul] [CORE:V0][12.142024] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][12.142157] Memory desc init by tag [memory] [CORE:V0][12.142160] Memory desc init by tag [memory] [CORE:V0][12.142165] Memory desc init by tag [memory] [CORE:V0][12.142169] Memory desc init by tag [memory] [CORE:V0][12.142045] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][12.142102] matmul primitive_desc create - attr [PROF:I][12.141891] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.005431,ms [API:I][12.142123] matmul primitive create [CORE:I][12.142076] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.142080] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.127351] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.133434] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.082ms graph_exe_count=-1 weight_address=0x70de89ffd040 [PROF:I][12.148015] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.11443,ms [API:I][12.148427] Memory create [CORE:V0][12.148415] Memory desc init by tag [memory] [CORE:I][12.148422] Memory created [memory] [API:I][12.148444] Memory create - strides [CORE:I][12.148429] Memory desc init by Stride [memory] [CORE:I][12.148433] Memory created [memory] [API:I][12.148454] Memory create [CORE:V0][12.148439] Memory desc init by tag [memory] [CORE:I][12.148446] Memory created [memory] [API:I][12.148469] Memory create [CORE:V0][12.148455] Memory desc init by tag [memory] [CORE:I][12.148461] Memory created [memory] [API:I][12.148485] Memory create [CORE:V0][12.148471] Memory desc init by tag [memory] [CORE:I][12.148475] Memory created [memory] [API:I][12.148408] matmul desc create - no bias [CORE:I][12.148407] matmul desc init [matmul] [CORE:V0][12.148374] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][12.148508] Memory desc init by tag [memory] [CORE:V0][12.148512] Memory desc init by tag [memory] [CORE:V0][12.148517] Memory desc init by tag [memory] [CORE:V0][12.148520] Memory desc init by tag [memory] [CORE:V0][12.148396] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][12.148455] matmul primitive_desc create - attr [PROF:I][12.148503] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.249438,ms [API:I][12.148739] matmul primitive create [CORE:I][12.148695] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.148698] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.133972] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.139910] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=5.939ms graph_exe_count=-1 weight_address=0x70de97ffe040 [PROF:I][12.154494] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,5.97552,ms [API:I][12.154938] Memory create [CORE:V0][12.154928] Memory desc init by tag [memory] [CORE:I][12.154938] Memory created [memory] [API:I][12.154961] Memory create - strides [CORE:I][12.154956] Memory desc init by Stride [memory] [CORE:I][12.154961] Memory created [memory] [API:I][12.154983] Memory create [CORE:V0][12.154968] Memory desc init by tag [memory] [CORE:I][12.154973] Memory created [memory] [API:I][12.154907] matmul desc create - no bias [CORE:I][12.154906] matmul desc init [matmul] [CORE:V0][12.154878] zendnn_f32_matmul_t::pd_t::init() [CORE:V0][12.155012] Memory desc init by tag [memory] [CORE:V0][12.155016] Memory desc init by tag [memory] [CORE:V0][12.155020] Memory desc init by tag [memory] [CORE:V0][12.155023] Memory desc init by tag [memory] [CORE:V0][12.154900] zendnn_gemm_f32_matmul_t::pd_t::check_and_configure_attributes [API:I][12.154960] matmul primitive_desc create - attr [PROF:I][12.154765] zendnn_primitive_create,cache_miss,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.01288,ms [API:I][12.154997] matmul primitive create [CORE:I][12.154952] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.154956] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.140229] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.146119] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=5.891ms graph_exe_count=-1 weight_address=0x70dea5fff040 [PROF:I][12.160705] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,5.92887,ms [API:I][12.161200] Memory create [CORE:V0][12.161190] Memory desc init by tag [memory] [CORE:I][12.161199] Memory created [memory] [API:I][12.161221] Memory create - strides [CORE:I][12.161208] Memory desc init by Stride [memory] [CORE:I][12.161211] Memory created [memory] [API:I][12.161233] Memory create [CORE:V0][12.161218] Memory desc init by tag [memory] [CORE:I][12.161223] Memory created [memory] [API:I][12.161245] Memory create [CORE:V0][12.161231] Memory desc init by tag [memory] [CORE:I][12.161236] Memory created [memory] [API:I][12.161258] Memory create - strides [CORE:I][12.161244] Memory desc init by Stride [memory] [CORE:I][12.161248] Memory created [memory] [API:I][12.161271] Memory create [CORE:V0][12.161256] Memory desc init by tag [memory] [CORE:I][12.161260] Memory created [memory] [API:I][12.161283] Memory create [CORE:V0][12.161267] Memory desc init by tag [memory] [CORE:I][12.161271] Memory created [memory] [API:I][12.161292] Memory create - strides [CORE:I][12.161277] Memory desc init by Stride [memory] [CORE:I][12.161282] Memory created [memory] [API:I][12.161303] Memory create [CORE:V0][12.161288] Memory desc init by tag [memory] [CORE:I][12.161292] Memory created [memory] [API:I][12.146548] CPU Engine create [CORE:V0][12.161370] CPU Engine created [engine] [CORE:I][12.161374] CPU Engine created [cpu/engine] [API:I][12.146560] CPU Stream create [CORE:I][12.160981] CPU Stream created [stream] [CORE:V0][12.160980] CPU Stream created [cpu/stream] [API:I][12.146578] matmul desc create - no bias [CORE:I][12.161250] matmul desc init [matmul] [API:I][12.146600] matmul primitive_desc create - attr [PROF:I][12.161059] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.004241,ms [API:I][12.146619] matmul primitive create [CORE:I][12.161248] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.161252] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.146526] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.148220] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.694ms graph_exe_count=-1 weight_address=0x70de4fff6040 [PROF:I][12.162790] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.71894,ms [API:I][12.148353] matmul desc create - no bias [CORE:I][12.163023] matmul desc init [matmul] [API:I][12.148365] matmul primitive_desc create - attr [PROF:I][12.162818] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.001721,ms [API:I][12.148378] matmul primitive create [CORE:I][12.163003] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.163006] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.148277] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.148735] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.458ms graph_exe_count=-1 weight_address=0x171558c0 [PROF:I][12.163305] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.477416,ms [API:I][12.148867] matmul desc create - no bias [CORE:I][12.163537] matmul desc init [matmul] [API:I][12.148876] matmul primitive_desc create - attr [PROF:I][12.163328] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00108,ms [API:I][12.148887] matmul primitive create [CORE:I][12.163513] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.163516] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.148786] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.149116] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.33ms graph_exe_count=-1 weight_address=0x18155900 [PROF:I][12.163686] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.349231,ms [PROF:V0][12.149246] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.69897,ms [CORE:I][12.163670] CPU Stream deleted [stream] [CORE:I][12.164075] CPU Engine deleted [engine] [API:I][12.164253] Memory create [CORE:V0][12.164243] Memory desc init by tag [memory] [CORE:I][12.164249] Memory created [memory] [API:I][12.164270] Memory create - strides [CORE:I][12.164256] Memory desc init by Stride [memory] [CORE:I][12.164260] Memory created [memory] [API:I][12.164282] Memory create [CORE:V0][12.164268] Memory desc init by tag [memory] [CORE:I][12.164272] Memory created [memory] [API:I][12.164203] matmul desc create - no bias [CORE:I][12.164201] matmul desc init [matmul] [API:I][12.164218] matmul primitive_desc create - attr [PROF:I][12.163999] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00236,ms [API:I][12.164232] matmul primitive create [CORE:I][12.164187] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.164191] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.149462] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.150999] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.537ms graph_exe_count=-1 weight_address=0x70de53ff7040 [PROF:I][12.165570] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.55834,ms [API:I][12.165938] Memory create [CORE:V0][12.165926] Memory desc init by tag [memory] [CORE:I][12.165930] Memory created [memory] [API:I][12.165952] Memory create - strides [CORE:I][12.165937] Memory desc init by Stride [memory] [CORE:I][12.165942] Memory created [memory] [API:I][12.165963] Memory create [CORE:V0][12.165961] Memory desc init by tag [memory] [CORE:I][12.165965] Memory created [memory] [API:I][12.165894] matmul desc create - no bias [CORE:I][12.165891] matmul desc init [matmul] [API:I][12.165908] matmul primitive_desc create - attr [PROF:I][12.165689] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.0016,ms [API:I][12.165921] matmul primitive create [CORE:I][12.165876] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.165879] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.151150] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.157593] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.443ms graph_exe_count=-1 weight_address=0x70de57ff8040 [PROF:I][12.172181] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.48022,ms [API:I][12.172602] Memory create [CORE:V0][12.172591] Memory desc init by tag [memory] [CORE:I][12.172599] Memory created [memory] [API:I][12.172621] Memory create - strides [CORE:I][12.172607] Memory desc init by Stride [memory] [CORE:I][12.172611] Memory created [memory] [API:I][12.172632] Memory create [CORE:V0][12.172617] Memory desc init by tag [memory] [CORE:I][12.172622] Memory created [memory] [API:I][12.172647] Memory create [CORE:V0][12.172632] Memory desc init by tag [memory] [CORE:I][12.172637] Memory created [memory] [API:I][12.172662] Memory create [CORE:V0][12.172649] Memory desc init by tag [memory] [CORE:I][12.172653] Memory created [memory] [API:I][12.172587] matmul desc create - no bias [CORE:I][12.172585] matmul desc init [matmul] [API:I][12.172609] matmul primitive_desc create - attr [PROF:I][12.172395] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.003491,ms [API:I][12.172629] matmul primitive create [CORE:I][12.172586] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.172589] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.157863] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.164029] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.166ms graph_exe_count=-1 weight_address=0x70de65ff9040 [PROF:I][12.178613] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.20331,ms [API:I][12.179046] Memory create [CORE:V0][12.179034] Memory desc init by tag [memory] [CORE:I][12.179044] Memory created [memory] [API:I][12.179067] Memory create - strides [CORE:I][12.179053] Memory desc init by Stride [memory] [CORE:I][12.179056] Memory created [memory] [API:I][12.179078] Memory create [CORE:V0][12.179063] Memory desc init by tag [memory] [CORE:I][12.179072] Memory created [memory] [API:I][12.179006] matmul desc create - no bias [CORE:I][12.179005] matmul desc init [matmul] [API:I][12.179028] matmul primitive_desc create - attr [PROF:I][12.178814] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00328,ms [API:I][12.179047] matmul primitive create [CORE:I][12.179003] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.179006] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.164280] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.170293] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.013ms graph_exe_count=-1 weight_address=0x70de73ffa040 [PROF:I][12.184878] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.05135,ms [API:I][12.185360] Memory create [CORE:V0][12.185350] Memory desc init by tag [memory] [CORE:I][12.185358] Memory created [memory] [API:I][12.185380] Memory create - strides [CORE:I][12.185366] Memory desc init by Stride [memory] [CORE:I][12.185372] Memory created [memory] [API:I][12.185393] Memory create [CORE:V0][12.185378] Memory desc init by tag [memory] [CORE:I][12.185384] Memory created [memory] [API:I][12.185407] Memory create [CORE:V0][12.185393] Memory desc init by tag [memory] [CORE:I][12.185398] Memory created [memory] [API:I][12.185419] Memory create - strides [CORE:I][12.185403] Memory desc init by Stride [memory] [CORE:I][12.185408] Memory created [memory] [API:I][12.185431] Memory create [CORE:V0][12.185417] Memory desc init by tag [memory] [CORE:I][12.185420] Memory created [memory] [API:I][12.185442] Memory create [CORE:V0][12.185428] Memory desc init by tag [memory] [CORE:I][12.185432] Memory created [memory] [API:I][12.185453] Memory create - strides [CORE:I][12.185438] Memory desc init by Stride [memory] [CORE:I][12.185443] Memory created [memory] [API:I][12.185466] Memory create [CORE:V0][12.185451] Memory desc init by tag [memory] [CORE:I][12.185454] Memory created [memory] [API:I][12.170710] CPU Engine create [CORE:V0][12.185532] CPU Engine created [engine] [CORE:I][12.185536] CPU Engine created [cpu/engine] [API:I][12.170723] CPU Stream create [CORE:I][12.185145] CPU Stream created [stream] [CORE:V0][12.185145] CPU Stream created [cpu/stream] [API:I][12.170743] matmul desc create - no bias [CORE:I][12.185414] matmul desc init [matmul] [API:I][12.170762] matmul primitive_desc create - attr [PROF:I][12.185219] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.003741,ms [API:I][12.170780] matmul primitive create [CORE:I][12.185408] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.185412] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.170686] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.172292] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.607ms graph_exe_count=-1 weight_address=0x70ddf9fed040 [PROF:I][12.186863] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.63208,ms [API:I][12.172426] matmul desc create - no bias [CORE:I][12.187097] matmul desc init [matmul] [API:I][12.172438] matmul primitive_desc create - attr [PROF:I][12.186891] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.001301,ms [API:I][12.172451] matmul primitive create [CORE:I][12.187076] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.187079] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.172350] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.172814] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.464ms graph_exe_count=-1 weight_address=0x1b15da40 [PROF:I][12.187384] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.482986,ms [API:I][12.172944] matmul desc create - no bias [CORE:I][12.187614] matmul desc init [matmul] [API:I][12.172954] matmul primitive_desc create - attr [PROF:I][12.187405] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00087,ms [API:I][12.172964] matmul primitive create [CORE:I][12.187589] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.187592] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.172863] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.173181] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.317ms graph_exe_count=-1 weight_address=0x1c15da80 [PROF:I][12.187751] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.336841,ms [PROF:V0][12.173311] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.60107,ms [CORE:I][12.187735] CPU Stream deleted [stream] [CORE:I][12.188139] CPU Engine deleted [engine] [API:I][12.188311] Memory create [CORE:V0][12.188300] Memory desc init by tag [memory] [CORE:I][12.188306] Memory created [memory] [API:I][12.188328] Memory create - strides [CORE:I][12.188315] Memory desc init by Stride [memory] [CORE:I][12.188319] Memory created [memory] [API:I][12.188341] Memory create [CORE:V0][12.188327] Memory desc init by tag [memory] [CORE:I][12.188333] Memory created [memory] [API:I][12.188263] matmul desc create - no bias [CORE:I][12.188260] matmul desc init [matmul] [API:I][12.188278] matmul primitive_desc create - attr [PROF:I][12.188060] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00224,ms [API:I][12.188292] matmul primitive create [CORE:I][12.188247] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.188250] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.173522] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.175238] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.717ms graph_exe_count=-1 weight_address=0x70ddfdfee040 [PROF:I][12.189808] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.73677,ms [API:I][12.190173] Memory create [CORE:V0][12.190160] Memory desc init by tag [memory] [CORE:I][12.190165] Memory created [memory] [API:I][12.190186] Memory create - strides [CORE:I][12.190171] Memory desc init by Stride [memory] [CORE:I][12.190175] Memory created [memory] [API:I][12.190196] Memory create [CORE:V0][12.190181] Memory desc init by tag [memory] [CORE:I][12.190186] Memory created [memory] [API:I][12.190114] matmul desc create - no bias [CORE:I][12.190111] matmul desc init [matmul] [API:I][12.190127] matmul primitive_desc create - attr [PROF:I][12.189907] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00153,ms [API:I][12.190138] matmul primitive create [CORE:I][12.190094] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.190098] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.175369] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.181655] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.287ms graph_exe_count=-1 weight_address=0x70de01fef040 [PROF:I][12.196240] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.32197,ms [API:I][12.196648] Memory create [CORE:V0][12.196636] Memory desc init by tag [memory] [CORE:I][12.196644] Memory created [memory] [API:I][12.196665] Memory create - strides [CORE:I][12.196651] Memory desc init by Stride [memory] [CORE:I][12.196654] Memory created [memory] [API:I][12.196675] Memory create [CORE:V0][12.196660] Memory desc init by tag [memory] [CORE:I][12.196666] Memory created [memory] [API:I][12.196690] Memory create [CORE:V0][12.196675] Memory desc init by tag [memory] [CORE:I][12.196680] Memory created [memory] [API:I][12.196705] Memory create [CORE:V0][12.196690] Memory desc init by tag [memory] [CORE:I][12.196694] Memory created [memory] [API:I][12.196628] matmul desc create - no bias [CORE:I][12.196626] matmul desc init [matmul] [API:I][12.196647] matmul primitive_desc create - attr [PROF:I][12.196433] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00355,ms [API:I][12.196665] matmul primitive create [CORE:I][12.196622] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.196626] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.181902] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.188454] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.555ms graph_exe_count=-1 weight_address=0x70de0fff0040 [PROF:I][12.203041] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.59589,ms [API:I][12.203462] Memory create [CORE:V0][12.203450] Memory desc init by tag [memory] [CORE:I][12.203460] Memory created [memory] [API:I][12.203483] Memory create - strides [CORE:I][12.203468] Memory desc init by Stride [memory] [CORE:I][12.203472] Memory created [memory] [API:I][12.203494] Memory create [CORE:V0][12.203479] Memory desc init by tag [memory] [CORE:I][12.203484] Memory created [memory] [API:I][12.203418] matmul desc create - no bias [CORE:I][12.203417] matmul desc init [matmul] [API:I][12.203440] matmul primitive_desc create - attr [PROF:I][12.203225] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00312,ms [API:I][12.203458] matmul primitive create [CORE:I][12.203416] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.203420] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.188693] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.195361] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.669ms graph_exe_count=-1 weight_address=0x70de1dff1040 [PROF:I][12.209946] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.70663,ms [API:I][12.210423] Memory create [CORE:V0][12.210414] Memory desc init by tag [memory] [CORE:I][12.210423] Memory created [memory] [API:I][12.210445] Memory create - strides [CORE:I][12.210431] Memory desc init by Stride [memory] [CORE:I][12.210435] Memory created [memory] [API:I][12.210456] Memory create [CORE:V0][12.210441] Memory desc init by tag [memory] [CORE:I][12.210446] Memory created [memory] [API:I][12.210469] Memory create [CORE:V0][12.210455] Memory desc init by tag [memory] [CORE:I][12.210459] Memory created [memory] [API:I][12.210481] Memory create - strides [CORE:I][12.210466] Memory desc init by Stride [memory] [CORE:I][12.210471] Memory created [memory] [API:I][12.210494] Memory create [CORE:V0][12.210479] Memory desc init by tag [memory] [CORE:I][12.210483] Memory created [memory] [API:I][12.210506] Memory create [CORE:V0][12.210491] Memory desc init by tag [memory] [CORE:I][12.210496] Memory created [memory] [API:I][12.210517] Memory create - strides [CORE:I][12.210502] Memory desc init by Stride [memory] [CORE:I][12.210507] Memory created [memory] [API:I][12.210529] Memory create [CORE:V0][12.210515] Memory desc init by tag [memory] [CORE:I][12.210519] Memory created [memory] [API:I][12.195775] CPU Engine create [CORE:V0][12.210598] CPU Engine created [engine] [CORE:I][12.210601] CPU Engine created [cpu/engine] [API:I][12.195788] CPU Stream create [CORE:I][12.210209] CPU Stream created [stream] [CORE:V0][12.210207] CPU Stream created [cpu/stream] [API:I][12.195805] matmul desc create - no bias [CORE:I][12.210477] matmul desc init [matmul] [API:I][12.195827] matmul primitive_desc create - attr [PROF:I][12.210282] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00362,ms [API:I][12.195842] matmul primitive create [CORE:I][12.210473] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.210477] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.195752] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.197425] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.674ms graph_exe_count=-1 weight_address=0x70ddc7fe8040 [PROF:I][12.211998] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.70157,ms [API:I][12.197561] matmul desc create - no bias [CORE:I][12.212232] matmul desc init [matmul] [API:I][12.197574] matmul primitive_desc create - attr [PROF:I][12.212026] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.0012,ms [API:I][12.197586] matmul primitive create [CORE:I][12.212212] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.212215] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.197486] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.197916] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.43ms graph_exe_count=-1 weight_address=0x1d165b40 [PROF:I][12.212485] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.449225,ms [API:I][12.198045] matmul desc create - no bias [CORE:I][12.212715] matmul desc init [matmul] [API:I][12.198054] matmul primitive_desc create - attr [PROF:I][12.212506] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00089,ms [API:I][12.198065] matmul primitive create [CORE:I][12.212691] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.212694] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.197964] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.198290] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.325ms graph_exe_count=-1 weight_address=0x1e165b80 [PROF:I][12.212860] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.344501,ms [PROF:V0][12.198420] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.64502,ms [CORE:I][12.212844] CPU Stream deleted [stream] [CORE:I][12.213250] CPU Engine deleted [engine] [API:I][12.213413] Memory create [CORE:V0][12.213402] Memory desc init by tag [memory] [CORE:I][12.213408] Memory created [memory] [API:I][12.213429] Memory create - strides [CORE:I][12.213415] Memory desc init by Stride [memory] [CORE:I][12.213419] Memory created [memory] [API:I][12.213440] Memory create [CORE:V0][12.213426] Memory desc init by tag [memory] [CORE:I][12.213430] Memory created [memory] [API:I][12.213360] matmul desc create - no bias [CORE:I][12.213359] matmul desc init [matmul] [API:I][12.213376] matmul primitive_desc create - attr [PROF:I][12.213157] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00227,ms [API:I][12.213391] matmul primitive create [CORE:I][12.213345] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.213348] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.198620] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.200333] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.712ms graph_exe_count=-1 weight_address=0x70ddcbfe9040 [PROF:I][12.214903] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.73389,ms [API:I][12.215272] Memory create [CORE:V0][12.215259] Memory desc init by tag [memory] [CORE:I][12.215264] Memory created [memory] [API:I][12.215285] Memory create - strides [CORE:I][12.215272] Memory desc init by Stride [memory] [CORE:I][12.215275] Memory created [memory] [API:I][12.215297] Memory create [CORE:V0][12.215284] Memory desc init by tag [memory] [CORE:I][12.215288] Memory created [memory] [API:I][12.215217] matmul desc create - no bias [CORE:I][12.215215] matmul desc init [matmul] [API:I][12.215231] matmul primitive_desc create - attr [PROF:I][12.215011] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00161,ms [API:I][12.215243] matmul primitive create [CORE:I][12.215198] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.215202] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.200473] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.206760] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.287ms graph_exe_count=-1 weight_address=0x70ddcffea040 [PROF:I][12.221348] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.32514,ms [API:I][12.221765] Memory create [CORE:V0][12.221754] Memory desc init by tag [memory] [CORE:I][12.221764] Memory created [memory] [API:I][12.221786] Memory create - strides [CORE:I][12.221772] Memory desc init by Stride [memory] [CORE:I][12.221777] Memory created [memory] [API:I][12.221799] Memory create [CORE:V0][12.221785] Memory desc init by tag [memory] [CORE:I][12.221791] Memory created [memory] [API:I][12.221817] Memory create [CORE:V0][12.221803] Memory desc init by tag [memory] [CORE:I][12.221807] Memory created [memory] [API:I][12.221830] Memory create [CORE:V0][12.221816] Memory desc init by tag [memory] [CORE:I][12.221820] Memory created [memory] [API:I][12.221754] matmul desc create - no bias [CORE:I][12.221752] matmul desc init [matmul] [API:I][12.221774] matmul primitive_desc create - attr [PROF:I][12.221560] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00321,ms [API:I][12.221793] matmul primitive create [CORE:I][12.221752] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.221755] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.207029] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.213354] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.326ms graph_exe_count=-1 weight_address=0x70ddddfeb040 [PROF:I][12.227940] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.36484,ms [API:I][12.228364] Memory create [CORE:V0][12.228353] Memory desc init by tag [memory] [CORE:I][12.228360] Memory created [memory] [API:I][12.228382] Memory create - strides [CORE:I][12.228369] Memory desc init by Stride [memory] [CORE:I][12.228372] Memory created [memory] [API:I][12.228394] Memory create [CORE:V0][12.228379] Memory desc init by tag [memory] [CORE:I][12.228384] Memory created [memory] [API:I][12.228318] matmul desc create - no bias [CORE:I][12.228316] matmul desc init [matmul] [API:I][12.228340] matmul primitive_desc create - attr [PROF:I][12.228125] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00334,ms [API:I][12.228358] matmul primitive create [CORE:I][12.228316] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.228320] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.213593] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.219798] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.206ms graph_exe_count=-1 weight_address=0x70ddebfec040 [PROF:I][12.234384] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.24468,ms [API:I][12.234870] Memory create [CORE:V0][12.234860] Memory desc init by tag [memory] [CORE:I][12.234870] Memory created [memory] [API:I][12.234891] Memory create - strides [CORE:I][12.234879] Memory desc init by Stride [memory] [CORE:I][12.234883] Memory created [memory] [API:I][12.234904] Memory create [CORE:V0][12.234889] Memory desc init by tag [memory] [CORE:I][12.234895] Memory created [memory] [API:I][12.234917] Memory create [CORE:V0][12.234903] Memory desc init by tag [memory] [CORE:I][12.234907] Memory created [memory] [API:I][12.234929] Memory create - strides [CORE:I][12.234913] Memory desc init by Stride [memory] [CORE:I][12.234917] Memory created [memory] [API:I][12.234940] Memory create [CORE:V0][12.234925] Memory desc init by tag [memory] [CORE:I][12.234929] Memory created [memory] [API:I][12.234951] Memory create [CORE:V0][12.234938] Memory desc init by tag [memory] [CORE:I][12.234942] Memory created [memory] [API:I][12.234963] Memory create - strides [CORE:I][12.234957] Memory desc init by Stride [memory] [CORE:I][12.234961] Memory created [memory] [API:I][12.234982] Memory create [CORE:V0][12.234967] Memory desc init by tag [memory] [CORE:I][12.234972] Memory created [memory] [API:I][12.220227] CPU Engine create [CORE:V0][12.235049] CPU Engine created [engine] [CORE:I][12.235053] CPU Engine created [cpu/engine] [API:I][12.220241] CPU Stream create [CORE:I][12.234662] CPU Stream created [stream] [CORE:V0][12.234661] CPU Stream created [cpu/stream] [API:I][12.220261] matmul desc create - no bias [CORE:I][12.234932] matmul desc init [matmul] [API:I][12.220282] matmul primitive_desc create - attr [PROF:I][12.234739] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00368,ms [API:I][12.220300] matmul primitive create [CORE:I][12.234929] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.234933] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.220208] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.221896] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.688ms graph_exe_count=-1 weight_address=0x70dd95fe3040 [PROF:I][12.236468] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.71614,ms [API:I][12.222031] matmul desc create - no bias [CORE:I][12.236701] matmul desc init [matmul] [API:I][12.222044] matmul primitive_desc create - attr [PROF:I][12.236497] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00146,ms [API:I][12.222057] matmul primitive create [CORE:I][12.236682] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.236685] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.221956] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.222412] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.455ms graph_exe_count=-1 weight_address=0x1f16dc40 [PROF:I][12.236982] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.475546,ms [API:I][12.222544] matmul desc create - no bias [CORE:I][12.237214] matmul desc init [matmul] [API:I][12.222553] matmul primitive_desc create - attr [PROF:I][12.237004] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00091,ms [API:I][12.222563] matmul primitive create [CORE:I][12.237189] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.237192] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.222463] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.222786] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.324ms graph_exe_count=-1 weight_address=0x2016dc80 [PROF:I][12.237356] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.343161,ms [PROF:V0][12.222917] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.68896,ms [CORE:I][12.237341] CPU Stream deleted [stream] [CORE:I][12.237747] CPU Engine deleted [engine] [API:I][12.237919] Memory create [CORE:V0][12.237908] Memory desc init by tag [memory] [CORE:I][12.237916] Memory created [memory] [API:I][12.237937] Memory create - strides [CORE:I][12.237924] Memory desc init by Stride [memory] [CORE:I][12.237927] Memory created [memory] [API:I][12.237948] Memory create [CORE:V0][12.237934] Memory desc init by tag [memory] [CORE:I][12.237939] Memory created [memory] [API:I][12.237884] matmul desc create - no bias [CORE:I][12.237882] matmul desc init [matmul] [API:I][12.237900] matmul primitive_desc create - attr [PROF:I][12.237681] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00227,ms [API:I][12.237914] matmul primitive create [CORE:I][12.237869] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.237872] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.223145] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.224752] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.608ms graph_exe_count=-1 weight_address=0x70dd99fe4040 [PROF:I][12.239323] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.63068,ms [API:I][12.239689] Memory create [CORE:V0][12.239676] Memory desc init by tag [memory] [CORE:I][12.239681] Memory created [memory] [API:I][12.239702] Memory create - strides [CORE:I][12.239688] Memory desc init by Stride [memory] [CORE:I][12.239692] Memory created [memory] [API:I][12.239713] Memory create [CORE:V0][12.239698] Memory desc init by tag [memory] [CORE:I][12.239703] Memory created [memory] [API:I][12.239632] matmul desc create - no bias [CORE:I][12.239629] matmul desc init [matmul] [API:I][12.239645] matmul primitive_desc create - attr [PROF:I][12.239425] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00147,ms [API:I][12.239657] matmul primitive create [CORE:I][12.239612] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.239616] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.224887] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.231268] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.381ms graph_exe_count=-1 weight_address=0x70dd9dfe5040 [PROF:I][12.245855] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.41772,ms [API:I][12.246270] Memory create [CORE:V0][12.246258] Memory desc init by tag [memory] [CORE:I][12.246267] Memory created [memory] [API:I][12.246289] Memory create - strides [CORE:I][12.246277] Memory desc init by Stride [memory] [CORE:I][12.246280] Memory created [memory] [API:I][12.246301] Memory create [CORE:V0][12.246286] Memory desc init by tag [memory] [CORE:I][12.246291] Memory created [memory] [API:I][12.246315] Memory create [CORE:V0][12.246301] Memory desc init by tag [memory] [CORE:I][12.246306] Memory created [memory] [API:I][12.246332] Memory create [CORE:V0][12.246318] Memory desc init by tag [memory] [CORE:I][12.246324] Memory created [memory] [API:I][12.246258] matmul desc create - no bias [CORE:I][12.246257] matmul desc init [matmul] [API:I][12.246279] matmul primitive_desc create - attr [PROF:I][12.246064] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00325,ms [API:I][12.246298] matmul primitive create [CORE:I][12.246254] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.246257] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.231532] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.238119] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.588ms graph_exe_count=-1 weight_address=0x70ddabfe6040 [PROF:I][12.252704] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.62544,ms [API:I][12.253125] Memory create [CORE:V0][12.253113] Memory desc init by tag [memory] [CORE:I][12.253123] Memory created [memory] [API:I][12.253146] Memory create - strides [CORE:I][12.253132] Memory desc init by Stride [memory] [CORE:I][12.253136] Memory created [memory] [API:I][12.253157] Memory create [CORE:V0][12.253143] Memory desc init by tag [memory] [CORE:I][12.253149] Memory created [memory] [API:I][12.253082] matmul desc create - no bias [CORE:I][12.253081] matmul desc init [matmul] [API:I][12.253103] matmul primitive_desc create - attr [PROF:I][12.252887] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00321,ms [API:I][12.253120] matmul primitive create [CORE:I][12.253077] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.253080] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.238353] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.244663] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.31ms graph_exe_count=-1 weight_address=0x70ddb9fe7040 [PROF:I][12.259247] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.34661,ms [API:I][12.259723] Memory create [CORE:V0][12.259713] Memory desc init by tag [memory] [CORE:I][12.259722] Memory created [memory] [API:I][12.259743] Memory create - strides [CORE:I][12.259729] Memory desc init by Stride [memory] [CORE:I][12.259734] Memory created [memory] [API:I][12.259755] Memory create [CORE:V0][12.259740] Memory desc init by tag [memory] [CORE:I][12.259745] Memory created [memory] [API:I][12.259768] Memory create [CORE:V0][12.259753] Memory desc init by tag [memory] [CORE:I][12.259757] Memory created [memory] [API:I][12.259779] Memory create - strides [CORE:I][12.259764] Memory desc init by Stride [memory] [CORE:I][12.259767] Memory created [memory] [API:I][12.259790] Memory create [CORE:V0][12.259776] Memory desc init by tag [memory] [CORE:I][12.259779] Memory created [memory] [API:I][12.259802] Memory create [CORE:V0][12.259788] Memory desc init by tag [memory] [CORE:I][12.259793] Memory created [memory] [API:I][12.259814] Memory create - strides [CORE:I][12.259799] Memory desc init by Stride [memory] [CORE:I][12.259803] Memory created [memory] [API:I][12.259824] Memory create [CORE:V0][12.259809] Memory desc init by tag [memory] [CORE:I][12.259813] Memory created [memory] [API:I][12.245069] CPU Engine create [CORE:V0][12.259891] CPU Engine created [engine] [CORE:I][12.259895] CPU Engine created [cpu/engine] [API:I][12.245081] CPU Stream create [CORE:I][12.259503] CPU Stream created [stream] [CORE:V0][12.259502] CPU Stream created [cpu/stream] [API:I][12.245100] matmul desc create - no bias [CORE:I][12.259771] matmul desc init [matmul] [API:I][12.245121] matmul primitive_desc create - attr [PROF:I][12.259577] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00428,ms [API:I][12.245137] matmul primitive create [CORE:I][12.259767] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.259771] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.245045] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.246782] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.738ms graph_exe_count=-1 weight_address=0x70dd63fde040 [PROF:I][12.261353] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.76313,ms [API:I][12.246916] matmul desc create - no bias [CORE:I][12.261586] matmul desc init [matmul] [API:I][12.246930] matmul primitive_desc create - attr [PROF:I][12.261383] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00174,ms [API:I][12.246943] matmul primitive create [CORE:I][12.261570] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.261573] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.246844] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.247307] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.462ms graph_exe_count=-1 weight_address=0x21175d40 [PROF:I][12.261877] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.482926,ms [API:I][12.247437] matmul desc create - no bias [CORE:I][12.262107] matmul desc init [matmul] [API:I][12.247446] matmul primitive_desc create - attr [PROF:I][12.261897] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00105,ms [API:I][12.247456] matmul primitive create [CORE:I][12.262081] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.262084] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.247355] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.247696] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.341ms graph_exe_count=-1 weight_address=0x22175d80 [PROF:I][12.262266] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.360052,ms [PROF:V0][12.247826] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.75781,ms [CORE:I][12.262250] CPU Stream deleted [stream] [CORE:I][12.262655] CPU Engine deleted [engine] [API:I][12.262821] Memory create [CORE:V0][12.262810] Memory desc init by tag [memory] [CORE:I][12.262816] Memory created [memory] [API:I][12.262838] Memory create - strides [CORE:I][12.262825] Memory desc init by Stride [memory] [CORE:I][12.262829] Memory created [memory] [API:I][12.262852] Memory create [CORE:V0][12.262837] Memory desc init by tag [memory] [CORE:I][12.262842] Memory created [memory] [API:I][12.262772] matmul desc create - no bias [CORE:I][12.262769] matmul desc init [matmul] [API:I][12.262785] matmul primitive_desc create - attr [PROF:I][12.262566] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00187,ms [API:I][12.262798] matmul primitive create [CORE:I][12.262753] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.262756] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.248028] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.249749] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.722ms graph_exe_count=-1 weight_address=0x70dd67fdf040 [PROF:I][12.264319] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.74259,ms [API:I][12.264692] Memory create [CORE:V0][12.264679] Memory desc init by tag [memory] [CORE:I][12.264684] Memory created [memory] [API:I][12.264705] Memory create - strides [CORE:I][12.264691] Memory desc init by Stride [memory] [CORE:I][12.264695] Memory created [memory] [API:I][12.264716] Memory create [CORE:V0][12.264701] Memory desc init by tag [memory] [CORE:I][12.264705] Memory created [memory] [API:I][12.264634] matmul desc create - no bias [CORE:I][12.264631] matmul desc init [matmul] [API:I][12.264649] matmul primitive_desc create - attr [PROF:I][12.264428] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00139,ms [API:I][12.264662] matmul primitive create [CORE:I][12.264615] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.264618] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.249889] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.256426] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.536ms graph_exe_count=-1 weight_address=0x70dd6bfe0040 [PROF:I][12.271011] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.5711,ms [API:I][12.271421] Memory create [CORE:V0][12.271409] Memory desc init by tag [memory] [CORE:I][12.271417] Memory created [memory] [API:I][12.271439] Memory create - strides [CORE:I][12.271426] Memory desc init by Stride [memory] [CORE:I][12.271430] Memory created [memory] [API:I][12.271452] Memory create [CORE:V0][12.271437] Memory desc init by tag [memory] [CORE:I][12.271441] Memory created [memory] [API:I][12.271465] Memory create [CORE:V0][12.271450] Memory desc init by tag [memory] [CORE:I][12.271456] Memory created [memory] [API:I][12.271483] Memory create [CORE:V0][12.271468] Memory desc init by tag [memory] [CORE:I][12.271473] Memory created [memory] [API:I][12.271406] matmul desc create - no bias [CORE:I][12.271405] matmul desc init [matmul] [API:I][12.271428] matmul primitive_desc create - attr [PROF:I][12.271212] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00304,ms [API:I][12.271445] matmul primitive create [CORE:I][12.271402] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.271406] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.256679] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.262939] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.262ms graph_exe_count=-1 weight_address=0x70dd79fe1040 [PROF:I][12.277527] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.30138,ms [API:I][12.277955] Memory create [CORE:V0][12.277943] Memory desc init by tag [memory] [CORE:I][12.277962] Memory created [memory] [API:I][12.277984] Memory create - strides [CORE:I][12.277970] Memory desc init by Stride [memory] [CORE:I][12.277974] Memory created [memory] [API:I][12.277995] Memory create [CORE:V0][12.277982] Memory desc init by tag [memory] [CORE:I][12.277988] Memory created [memory] [API:I][12.277922] matmul desc create - no bias [CORE:I][12.277920] matmul desc init [matmul] [API:I][12.277942] matmul primitive_desc create - attr [PROF:I][12.277726] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00313,ms [API:I][12.277958] matmul primitive create [CORE:I][12.277915] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.277919] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.263192] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.269313] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.122ms graph_exe_count=-1 weight_address=0x70dd87fe2040 [PROF:I][12.283898] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.15828,ms [API:I][12.284373] Memory create [CORE:V0][12.284364] Memory desc init by tag [memory] [CORE:I][12.284373] Memory created [memory] [API:I][12.284396] Memory create - strides [CORE:I][12.284382] Memory desc init by Stride [memory] [CORE:I][12.284386] Memory created [memory] [API:I][12.284408] Memory create [CORE:V0][12.284393] Memory desc init by tag [memory] [CORE:I][12.284400] Memory created [memory] [API:I][12.284422] Memory create [CORE:V0][12.284408] Memory desc init by tag [memory] [CORE:I][12.284412] Memory created [memory] [API:I][12.284433] Memory create - strides [CORE:I][12.284418] Memory desc init by Stride [memory] [CORE:I][12.284423] Memory created [memory] [API:I][12.284446] Memory create [CORE:V0][12.284432] Memory desc init by tag [memory] [CORE:I][12.284435] Memory created [memory] [API:I][12.284457] Memory create [CORE:V0][12.284443] Memory desc init by tag [memory] [CORE:I][12.284448] Memory created [memory] [API:I][12.284469] Memory create - strides [CORE:I][12.284454] Memory desc init by Stride [memory] [CORE:I][12.284458] Memory created [memory] [API:I][12.284479] Memory create [CORE:V0][12.284465] Memory desc init by tag [memory] [CORE:I][12.284468] Memory created [memory] [API:I][12.269725] CPU Engine create [CORE:V0][12.284547] CPU Engine created [engine] [CORE:I][12.284551] CPU Engine created [cpu/engine] [API:I][12.269736] CPU Stream create [CORE:I][12.284158] CPU Stream created [stream] [CORE:V0][12.284158] CPU Stream created [cpu/stream] [API:I][12.269758] matmul desc create - no bias [CORE:I][12.284429] matmul desc init [matmul] [API:I][12.269779] matmul primitive_desc create - attr [PROF:I][12.284237] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.0053,ms [API:I][12.269797] matmul primitive create [CORE:I][12.284426] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.284429] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.269705] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.271329] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.624ms graph_exe_count=-1 weight_address=0x70dd31fd9040 [PROF:I][12.285899] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.65065,ms [API:I][12.271462] matmul desc create - no bias [CORE:I][12.286132] matmul desc init [matmul] [API:I][12.271474] matmul primitive_desc create - attr [PROF:I][12.285927] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00178,ms [API:I][12.271486] matmul primitive create [CORE:I][12.286112] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.286115] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.271386] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.271800] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.415ms graph_exe_count=-1 weight_address=0x2317de40 [PROF:I][12.286370] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.434054,ms [API:I][12.271931] matmul desc create - no bias [CORE:I][12.286601] matmul desc init [matmul] [API:I][12.271940] matmul primitive_desc create - attr [PROF:I][12.286391] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00083,ms [API:I][12.271950] matmul primitive create [CORE:I][12.286576] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.286579] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.271849] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.272187] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.338ms graph_exe_count=-1 weight_address=0x2417de80 [PROF:I][12.286757] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.356462,ms [PROF:V0][12.272317] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.5918,ms [CORE:I][12.286741] CPU Stream deleted [stream] [CORE:I][12.287145] CPU Engine deleted [engine] [API:I][12.287310] Memory create [CORE:V0][12.287299] Memory desc init by tag [memory] [CORE:I][12.287304] Memory created [memory] [API:I][12.287326] Memory create - strides [CORE:I][12.287312] Memory desc init by Stride [memory] [CORE:I][12.287317] Memory created [memory] [API:I][12.287338] Memory create [CORE:V0][12.287323] Memory desc init by tag [memory] [CORE:I][12.287329] Memory created [memory] [API:I][12.287259] matmul desc create - no bias [CORE:I][12.287257] matmul desc init [matmul] [API:I][12.287274] matmul primitive_desc create - attr [PROF:I][12.287054] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00188,ms [API:I][12.287286] matmul primitive create [CORE:I][12.287241] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.287244] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.272516] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.274213] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.698ms graph_exe_count=-1 weight_address=0x70dd35fda040 [PROF:I][12.288784] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.71944,ms [API:I][12.289150] Memory create [CORE:V0][12.289137] Memory desc init by tag [memory] [CORE:I][12.289141] Memory created [memory] [API:I][12.289163] Memory create - strides [CORE:I][12.289148] Memory desc init by Stride [memory] [CORE:I][12.289151] Memory created [memory] [API:I][12.289172] Memory create [CORE:V0][12.289158] Memory desc init by tag [memory] [CORE:I][12.289163] Memory created [memory] [API:I][12.289091] matmul desc create - no bias [CORE:I][12.289089] matmul desc init [matmul] [API:I][12.289104] matmul primitive_desc create - attr [PROF:I][12.288884] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00166,ms [API:I][12.289116] matmul primitive create [CORE:I][12.289071] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.289074] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.274347] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.280827] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.481ms graph_exe_count=-1 weight_address=0x70dd39fdb040 [PROF:I][12.295415] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.5186,ms [API:I][12.295831] Memory create [CORE:V0][12.295819] Memory desc init by tag [memory] [CORE:I][12.295828] Memory created [memory] [API:I][12.295850] Memory create - strides [CORE:I][12.295836] Memory desc init by Stride [memory] [CORE:I][12.295840] Memory created [memory] [API:I][12.295861] Memory create [CORE:V0][12.295846] Memory desc init by tag [memory] [CORE:I][12.295853] Memory created [memory] [API:I][12.295878] Memory create [CORE:V0][12.295863] Memory desc init by tag [memory] [CORE:I][12.295867] Memory created [memory] [API:I][12.295893] Memory create [CORE:V0][12.295878] Memory desc init by tag [memory] [CORE:I][12.295882] Memory created [memory] [API:I][12.295815] matmul desc create - no bias [CORE:I][12.295814] matmul desc init [matmul] [API:I][12.295835] matmul primitive_desc create - attr [PROF:I][12.295620] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.003671,ms [API:I][12.295853] matmul primitive create [CORE:I][12.295810] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.295813] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.281098] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.287545] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.448ms graph_exe_count=-1 weight_address=0x70dd47fdc040 [PROF:I][12.302130] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.49628,ms [API:I][12.302557] Memory create [CORE:V0][12.302546] Memory desc init by tag [memory] [CORE:I][12.302554] Memory created [memory] [API:I][12.302576] Memory create - strides [CORE:I][12.302562] Memory desc init by Stride [memory] [CORE:I][12.302567] Memory created [memory] [API:I][12.302588] Memory create [CORE:V0][12.302573] Memory desc init by tag [memory] [CORE:I][12.302580] Memory created [memory] [API:I][12.302513] matmul desc create - no bias [CORE:I][12.302512] matmul desc init [matmul] [API:I][12.302533] matmul primitive_desc create - attr [PROF:I][12.302318] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00342,ms [API:I][12.302551] matmul primitive create [CORE:I][12.302509] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.302513] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.287786] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.294262] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.476ms graph_exe_count=-1 weight_address=0x70dd55fdd040 [PROF:I][12.308844] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.51211,ms [API:I][12.309316] Memory create [CORE:V0][12.309306] Memory desc init by tag [memory] [CORE:I][12.309315] Memory created [memory] [API:I][12.309336] Memory create - strides [CORE:I][12.309323] Memory desc init by Stride [memory] [CORE:I][12.309327] Memory created [memory] [API:I][12.309348] Memory create [CORE:V0][12.309334] Memory desc init by tag [memory] [CORE:I][12.309338] Memory created [memory] [API:I][12.309361] Memory create [CORE:V0][12.309346] Memory desc init by tag [memory] [CORE:I][12.309350] Memory created [memory] [API:I][12.309371] Memory create - strides [CORE:I][12.309356] Memory desc init by Stride [memory] [CORE:I][12.309360] Memory created [memory] [API:I][12.309382] Memory create [CORE:V0][12.309368] Memory desc init by tag [memory] [CORE:I][12.309373] Memory created [memory] [API:I][12.309395] Memory create [CORE:V0][12.309380] Memory desc init by tag [memory] [CORE:I][12.309385] Memory created [memory] [API:I][12.309406] Memory create - strides [CORE:I][12.309390] Memory desc init by Stride [memory] [CORE:I][12.309395] Memory created [memory] [API:I][12.309419] Memory create [CORE:V0][12.309404] Memory desc init by tag [memory] [CORE:I][12.309407] Memory created [memory] [API:I][12.294662] CPU Engine create [CORE:V0][12.309485] CPU Engine created [engine] [CORE:I][12.309489] CPU Engine created [cpu/engine] [API:I][12.294675] CPU Stream create [CORE:I][12.309096] CPU Stream created [stream] [CORE:V0][12.309095] CPU Stream created [cpu/stream] [API:I][12.294694] matmul desc create - no bias [CORE:I][12.309365] matmul desc init [matmul] [API:I][12.294712] matmul primitive_desc create - attr [PROF:I][12.309169] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.003551,ms [API:I][12.294729] matmul primitive create [CORE:I][12.309358] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.309362] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.294637] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.296922] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=2.287ms graph_exe_count=-1 weight_address=0x70dcfffd4040 [PROF:I][12.311493] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,2.31193,ms [API:I][12.297058] matmul desc create - no bias [CORE:I][12.311729] matmul desc init [matmul] [API:I][12.297071] matmul primitive_desc create - attr [PROF:I][12.311524] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00177,ms [API:I][12.297083] matmul primitive create [CORE:I][12.311709] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.311712] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.296983] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.297445] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.462ms graph_exe_count=-1 weight_address=0x25185f40 [PROF:I][12.312015] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.481586,ms [API:I][12.297577] matmul desc create - no bias [CORE:I][12.312247] matmul desc init [matmul] [API:I][12.297586] matmul primitive_desc create - attr [PROF:I][12.312037] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00089,ms [API:I][12.297596] matmul primitive create [CORE:I][12.312221] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.312225] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.297495] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.297842] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.347ms graph_exe_count=-1 weight_address=0x26185f80 [PROF:I][12.312412] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.365782,ms [PROF:V0][12.297972] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,3.30884,ms [CORE:I][12.312396] CPU Stream deleted [stream] [CORE:I][12.312800] CPU Engine deleted [engine] [API:I][12.312980] Memory create [CORE:V0][12.312969] Memory desc init by tag [memory] [CORE:I][12.312975] Memory created [memory] [API:I][12.312996] Memory create - strides [CORE:I][12.312981] Memory desc init by Stride [memory] [CORE:I][12.312987] Memory created [memory] [API:I][12.313009] Memory create [CORE:V0][12.312994] Memory desc init by tag [memory] [CORE:I][12.312998] Memory created [memory] [API:I][12.312929] matmul desc create - no bias [CORE:I][12.312927] matmul desc init [matmul] [API:I][12.312944] matmul primitive_desc create - attr [PROF:I][12.312725] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.001951,ms [API:I][12.312957] matmul primitive create [CORE:I][12.312911] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.312915] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.298186] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.299864] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.679ms graph_exe_count=-1 weight_address=0x70dd03fd5040 [PROF:I][12.314435] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.69942,ms [API:I][12.314806] Memory create [CORE:V0][12.314793] Memory desc init by tag [memory] [CORE:I][12.314798] Memory created [memory] [API:I][12.314819] Memory create - strides [CORE:I][12.314805] Memory desc init by Stride [memory] [CORE:I][12.314809] Memory created [memory] [API:I][12.314831] Memory create [CORE:V0][12.314817] Memory desc init by tag [memory] [CORE:I][12.314821] Memory created [memory] [API:I][12.314751] matmul desc create - no bias [CORE:I][12.314749] matmul desc init [matmul] [API:I][12.314764] matmul primitive_desc create - attr [PROF:I][12.314544] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00149,ms [API:I][12.314776] matmul primitive create [CORE:I][12.314731] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.314734] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.300005] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.306419] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.413ms graph_exe_count=-1 weight_address=0x70dd07fd6040 [PROF:I][12.321007] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.45152,ms [API:I][12.321423] Memory create [CORE:V0][12.321412] Memory desc init by tag [memory] [CORE:I][12.321420] Memory created [memory] [API:I][12.321442] Memory create - strides [CORE:I][12.321429] Memory desc init by Stride [memory] [CORE:I][12.321432] Memory created [memory] [API:I][12.321453] Memory create [CORE:V0][12.321440] Memory desc init by tag [memory] [CORE:I][12.321445] Memory created [memory] [API:I][12.321472] Memory create [CORE:V0][12.321458] Memory desc init by tag [memory] [CORE:I][12.321464] Memory created [memory] [API:I][12.321487] Memory create [CORE:V0][12.321473] Memory desc init by tag [memory] [CORE:I][12.321477] Memory created [memory] [API:I][12.321414] matmul desc create - no bias [CORE:I][12.321413] matmul desc init [matmul] [API:I][12.321434] matmul primitive_desc create - attr [PROF:I][12.321220] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00369,ms [API:I][12.321453] matmul primitive create [CORE:I][12.321409] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.321413] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.306689] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.312974] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.288ms graph_exe_count=-1 weight_address=0x70dd15fd7040 [PROF:I][12.327559] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.32562,ms [API:I][12.327994] Memory create [CORE:V0][12.327983] Memory desc init by tag [memory] [CORE:I][12.327990] Memory created [memory] [API:I][12.328012] Memory create - strides [CORE:I][12.327999] Memory desc init by Stride [memory] [CORE:I][12.328003] Memory created [memory] [API:I][12.328024] Memory create [CORE:V0][12.328010] Memory desc init by tag [memory] [CORE:I][12.328014] Memory created [memory] [API:I][12.327947] matmul desc create - no bias [CORE:I][12.327945] matmul desc init [matmul] [API:I][12.327968] matmul primitive_desc create - attr [PROF:I][12.327753] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00334,ms [API:I][12.327985] matmul primitive create [CORE:I][12.327943] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.327946] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.313220] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.319297] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.078ms graph_exe_count=-1 weight_address=0x70dd23fd8040 [PROF:I][12.333882] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.11683,ms [API:I][12.334365] Memory create [CORE:V0][12.334355] Memory desc init by tag [memory] [CORE:I][12.334364] Memory created [memory] [API:I][12.334386] Memory create - strides [CORE:I][12.334372] Memory desc init by Stride [memory] [CORE:I][12.334376] Memory created [memory] [API:I][12.334398] Memory create [CORE:V0][12.334383] Memory desc init by tag [memory] [CORE:I][12.334387] Memory created [memory] [API:I][12.334411] Memory create [CORE:V0][12.334396] Memory desc init by tag [memory] [CORE:I][12.334400] Memory created [memory] [API:I][12.334422] Memory create - strides [CORE:I][12.334407] Memory desc init by Stride [memory] [CORE:I][12.334411] Memory created [memory] [API:I][12.334435] Memory create [CORE:V0][12.334420] Memory desc init by tag [memory] [CORE:I][12.334424] Memory created [memory] [API:I][12.334446] Memory create [CORE:V0][12.334434] Memory desc init by tag [memory] [CORE:I][12.334438] Memory created [memory] [API:I][12.334459] Memory create - strides [CORE:I][12.334445] Memory desc init by Stride [memory] [CORE:I][12.334449] Memory created [memory] [API:I][12.334472] Memory create [CORE:V0][12.334457] Memory desc init by tag [memory] [CORE:I][12.334461] Memory created [memory] [API:I][12.319717] CPU Engine create [CORE:V0][12.334539] CPU Engine created [engine] [CORE:I][12.334543] CPU Engine created [cpu/engine] [API:I][12.319729] CPU Stream create [CORE:I][12.334152] CPU Stream created [stream] [CORE:V0][12.334152] CPU Stream created [cpu/stream] [API:I][12.319752] matmul desc create - no bias [CORE:I][12.334423] matmul desc init [matmul] [API:I][12.319772] matmul primitive_desc create - attr [PROF:I][12.334229] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00344,ms [API:I][12.319789] matmul primitive create [CORE:I][12.334417] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.334421] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.319696] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.321464] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.768ms graph_exe_count=-1 weight_address=0x70dccdfcf040 [PROF:I][12.336035] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.7939,ms [API:I][12.321598] matmul desc create - no bias [CORE:I][12.336269] matmul desc init [matmul] [API:I][12.321611] matmul primitive_desc create - attr [PROF:I][12.336064] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00126,ms [API:I][12.321623] matmul primitive create [CORE:I][12.336249] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.336252] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.321523] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.321989] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.467ms graph_exe_count=-1 weight_address=0x2718e040 [PROF:I][12.336560] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.486806,ms [API:I][12.322121] matmul desc create - no bias [CORE:I][12.336791] matmul desc init [matmul] [API:I][12.322130] matmul primitive_desc create - attr [PROF:I][12.336581] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00088,ms [API:I][12.322140] matmul primitive create [CORE:I][12.336765] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.336769] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.322039] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.322364] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.325ms graph_exe_count=-1 weight_address=0x2818e080 [PROF:I][12.336934] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.344151,ms [PROF:V0][12.322494] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.77783,ms [CORE:I][12.336918] CPU Stream deleted [stream] [CORE:I][12.337323] CPU Engine deleted [engine] [API:I][12.337495] Memory create [CORE:V0][12.337483] Memory desc init by tag [memory] [CORE:I][12.337490] Memory created [memory] [API:I][12.337511] Memory create - strides [CORE:I][12.337497] Memory desc init by Stride [memory] [CORE:I][12.337501] Memory created [memory] [API:I][12.337523] Memory create [CORE:V0][12.337509] Memory desc init by tag [memory] [CORE:I][12.337514] Memory created [memory] [API:I][12.337445] matmul desc create - no bias [CORE:I][12.337443] matmul desc init [matmul] [API:I][12.337460] matmul primitive_desc create - attr [PROF:I][12.337241] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00182,ms [API:I][12.337474] matmul primitive create [CORE:I][12.337428] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.337431] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.322702] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.324391] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.689ms graph_exe_count=-1 weight_address=0x70dcd1fd0040 [PROF:I][12.338961] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.70953,ms [API:I][12.339326] Memory create [CORE:V0][12.339313] Memory desc init by tag [memory] [CORE:I][12.339318] Memory created [memory] [API:I][12.339339] Memory create - strides [CORE:I][12.339325] Memory desc init by Stride [memory] [CORE:I][12.339330] Memory created [memory] [API:I][12.339351] Memory create [CORE:V0][12.339338] Memory desc init by tag [memory] [CORE:I][12.339344] Memory created [memory] [API:I][12.339273] matmul desc create - no bias [CORE:I][12.339270] matmul desc init [matmul] [API:I][12.339286] matmul primitive_desc create - attr [PROF:I][12.339066] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00137,ms [API:I][12.339298] matmul primitive create [CORE:I][12.339253] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.339257] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.324528] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.330895] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.367ms graph_exe_count=-1 weight_address=0x70dcd5fd1040 [PROF:I][12.345480] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.40325,ms [API:I][12.345888] Memory create [CORE:V0][12.345876] Memory desc init by tag [memory] [CORE:I][12.345885] Memory created [memory] [API:I][12.345909] Memory create - strides [CORE:I][12.345896] Memory desc init by Stride [memory] [CORE:I][12.345898] Memory created [memory] [API:I][12.345921] Memory create [CORE:V0][12.345908] Memory desc init by tag [memory] [CORE:I][12.345913] Memory created [memory] [API:I][12.345938] Memory create [CORE:V0][12.345923] Memory desc init by tag [memory] [CORE:I][12.345928] Memory created [memory] [API:I][12.345954] Memory create [CORE:V0][12.345940] Memory desc init by tag [memory] [CORE:I][12.345944] Memory created [memory] [API:I][12.345887] matmul desc create - no bias [CORE:I][12.345886] matmul desc init [matmul] [API:I][12.345908] matmul primitive_desc create - attr [PROF:I][12.345693] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00372,ms [API:I][12.345926] matmul primitive create [CORE:I][12.345882] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.345886] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.331161] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.337339] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.179ms graph_exe_count=-1 weight_address=0x70dce3fd2040 [PROF:I][12.351924] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.21778,ms [API:I][12.352346] Memory create [CORE:V0][12.352334] Memory desc init by tag [memory] [CORE:I][12.352342] Memory created [memory] [API:I][12.352363] Memory create - strides [CORE:I][12.352349] Memory desc init by Stride [memory] [CORE:I][12.352353] Memory created [memory] [API:I][12.352373] Memory create [CORE:V0][12.352359] Memory desc init by tag [memory] [CORE:I][12.352364] Memory created [memory] [API:I][12.352297] matmul desc create - no bias [CORE:I][12.352295] matmul desc init [matmul] [API:I][12.352318] matmul primitive_desc create - attr [PROF:I][12.352102] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00331,ms [API:I][12.352335] matmul primitive create [CORE:I][12.352293] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.352296] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.337570] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.343798] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.23ms graph_exe_count=-1 weight_address=0x70dcf1fd3040 [PROF:I][12.358385] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.26839,ms [API:I][12.358865] Memory create [CORE:V0][12.358855] Memory desc init by tag [memory] [CORE:I][12.358865] Memory created [memory] [API:I][12.358887] Memory create - strides [CORE:I][12.358874] Memory desc init by Stride [memory] [CORE:I][12.358879] Memory created [memory] [API:I][12.358902] Memory create [CORE:V0][12.358888] Memory desc init by tag [memory] [CORE:I][12.358892] Memory created [memory] [API:I][12.358914] Memory create [CORE:V0][12.358900] Memory desc init by tag [memory] [CORE:I][12.358904] Memory created [memory] [API:I][12.358925] Memory create - strides [CORE:I][12.358910] Memory desc init by Stride [memory] [CORE:I][12.358916] Memory created [memory] [API:I][12.358938] Memory create [CORE:V0][12.358924] Memory desc init by tag [memory] [CORE:I][12.358929] Memory created [memory] [API:I][12.358951] Memory create [CORE:V0][12.358936] Memory desc init by tag [memory] [CORE:I][12.358940] Memory created [memory] [API:I][12.358962] Memory create - strides [CORE:I][12.358946] Memory desc init by Stride [memory] [CORE:I][12.358959] Memory created [memory] [API:I][12.358980] Memory create [CORE:V0][12.358966] Memory desc init by tag [memory] [CORE:I][12.358970] Memory created [memory] [API:I][12.344226] CPU Engine create [CORE:V0][12.359048] CPU Engine created [engine] [CORE:I][12.359052] CPU Engine created [cpu/engine] [API:I][12.344238] CPU Stream create [CORE:I][12.358659] CPU Stream created [stream] [CORE:V0][12.358659] CPU Stream created [cpu/stream] [API:I][12.344258] matmul desc create - no bias [CORE:I][12.358928] matmul desc init [matmul] [API:I][12.344278] matmul primitive_desc create - attr [PROF:I][12.358735] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.003861,ms [API:I][12.344295] matmul primitive create [CORE:I][12.358925] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.358928] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.344203] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.345983] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.78ms graph_exe_count=-1 weight_address=0x70dc9bfca040 [PROF:I][12.360553] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.80616,ms [API:I][12.346116] matmul desc create - no bias [CORE:I][12.360786] matmul desc init [matmul] [API:I][12.346128] matmul primitive_desc create - attr [PROF:I][12.360580] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00134,ms [API:I][12.346139] matmul primitive create [CORE:I][12.360765] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.360769] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.346040] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.346507] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.467ms graph_exe_count=-1 weight_address=0x29196140 [PROF:I][12.361077] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.487496,ms [API:I][12.346637] matmul desc create - no bias [CORE:I][12.361307] matmul desc init [matmul] [API:I][12.346645] matmul primitive_desc create - attr [PROF:I][12.361097] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00092,ms [API:I][12.346655] matmul primitive create [CORE:I][12.361281] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.361285] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.346555] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.346889] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.334ms graph_exe_count=-1 weight_address=0x2a196180 [PROF:I][12.361459] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.353281,ms [PROF:V0][12.347020] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.79492,ms [CORE:I][12.361444] CPU Stream deleted [stream] [CORE:I][12.361848] CPU Engine deleted [engine] [API:I][12.362023] Memory create [CORE:V0][12.362012] Memory desc init by tag [memory] [CORE:I][12.362018] Memory created [memory] [API:I][12.362039] Memory create - strides [CORE:I][12.362025] Memory desc init by Stride [memory] [CORE:I][12.362029] Memory created [memory] [API:I][12.362052] Memory create [CORE:V0][12.362037] Memory desc init by tag [memory] [CORE:I][12.362042] Memory created [memory] [API:I][12.361974] matmul desc create - no bias [CORE:I][12.361972] matmul desc init [matmul] [API:I][12.361989] matmul primitive_desc create - attr [PROF:I][12.361771] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.0019,ms [API:I][12.362003] matmul primitive create [CORE:I][12.361958] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.361962] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.347233] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.348945] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.712ms graph_exe_count=-1 weight_address=0x70dc9ffcb040 [PROF:I][12.363516] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.73389,ms [API:I][12.363887] Memory create [CORE:V0][12.363874] Memory desc init by tag [memory] [CORE:I][12.363879] Memory created [memory] [API:I][12.363900] Memory create - strides [CORE:I][12.363886] Memory desc init by Stride [memory] [CORE:I][12.363891] Memory created [memory] [API:I][12.363914] Memory create [CORE:V0][12.363899] Memory desc init by tag [memory] [CORE:I][12.363902] Memory created [memory] [API:I][12.363832] matmul desc create - no bias [CORE:I][12.363830] matmul desc init [matmul] [API:I][12.363846] matmul primitive_desc create - attr [PROF:I][12.363626] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.0015,ms [API:I][12.363859] matmul primitive create [CORE:I][12.363814] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.363818] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.349098] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.355499] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.401ms graph_exe_count=-1 weight_address=0x70dca3fcc040 [PROF:I][12.370084] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.4457,ms [API:I][12.370499] Memory create [CORE:V0][12.370487] Memory desc init by tag [memory] [CORE:I][12.370497] Memory created [memory] [API:I][12.370519] Memory create - strides [CORE:I][12.370506] Memory desc init by Stride [memory] [CORE:I][12.370509] Memory created [memory] [API:I][12.370530] Memory create [CORE:V0][12.370517] Memory desc init by tag [memory] [CORE:I][12.370522] Memory created [memory] [API:I][12.370549] Memory create [CORE:V0][12.370534] Memory desc init by tag [memory] [CORE:I][12.370539] Memory created [memory] [API:I][12.370563] Memory create [CORE:V0][12.370548] Memory desc init by tag [memory] [CORE:I][12.370552] Memory created [memory] [API:I][12.370486] matmul desc create - no bias [CORE:I][12.370485] matmul desc init [matmul] [API:I][12.370508] matmul primitive_desc create - attr [PROF:I][12.370293] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00332,ms [API:I][12.370526] matmul primitive create [CORE:I][12.370482] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.370485] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.355760] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.362038] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.28ms graph_exe_count=-1 weight_address=0x70dcb1fcd040 [PROF:I][12.376624] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.31876,ms [API:I][12.377058] Memory create [CORE:V0][12.377046] Memory desc init by tag [memory] [CORE:I][12.377054] Memory created [memory] [API:I][12.377076] Memory create - strides [CORE:I][12.377064] Memory desc init by Stride [memory] [CORE:I][12.377068] Memory created [memory] [API:I][12.377089] Memory create [CORE:V0][12.377074] Memory desc init by tag [memory] [CORE:I][12.377081] Memory created [memory] [API:I][12.377014] matmul desc create - no bias [CORE:I][12.377013] matmul desc init [matmul] [API:I][12.377035] matmul primitive_desc create - attr [PROF:I][12.376820] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.003441,ms [API:I][12.377053] matmul primitive create [CORE:I][12.377011] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.377015] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.362288] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.368690] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.402ms graph_exe_count=-1 weight_address=0x70dcbffce040 [PROF:I][12.383274] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.43967,ms [API:I][12.383753] Memory create [CORE:V0][12.383744] Memory desc init by tag [memory] [CORE:I][12.383753] Memory created [memory] [API:I][12.383775] Memory create - strides [CORE:I][12.383763] Memory desc init by Stride [memory] [CORE:I][12.383767] Memory created [memory] [API:I][12.383789] Memory create [CORE:V0][12.383774] Memory desc init by tag [memory] [CORE:I][12.383779] Memory created [memory] [API:I][12.383802] Memory create [CORE:V0][12.383787] Memory desc init by tag [memory] [CORE:I][12.383791] Memory created [memory] [API:I][12.383812] Memory create - strides [CORE:I][12.383798] Memory desc init by Stride [memory] [CORE:I][12.383803] Memory created [memory] [API:I][12.383828] Memory create [CORE:V0][12.383813] Memory desc init by tag [memory] [CORE:I][12.383817] Memory created [memory] [API:I][12.383840] Memory create [CORE:V0][12.383825] Memory desc init by tag [memory] [CORE:I][12.383829] Memory created [memory] [API:I][12.383850] Memory create - strides [CORE:I][12.383836] Memory desc init by Stride [memory] [CORE:I][12.383841] Memory created [memory] [API:I][12.383864] Memory create [CORE:V0][12.383849] Memory desc init by tag [memory] [CORE:I][12.383853] Memory created [memory] [API:I][12.369110] CPU Engine create [CORE:V0][12.383932] CPU Engine created [engine] [CORE:I][12.383936] CPU Engine created [cpu/engine] [API:I][12.369123] CPU Stream create [CORE:I][12.383545] CPU Stream created [stream] [CORE:V0][12.383545] CPU Stream created [cpu/stream] [API:I][12.369144] matmul desc create - no bias [CORE:I][12.383814] matmul desc init [matmul] [API:I][12.369163] matmul primitive_desc create - attr [PROF:I][12.383620] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00385,ms [API:I][12.369180] matmul primitive create [CORE:I][12.383811] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.383814] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.369099] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.370647] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.549ms graph_exe_count=-1 weight_address=0x70de2bff2040 [PROF:I][12.385219] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.58661,ms [API:I][12.370781] matmul desc create - no bias [CORE:I][12.385452] matmul desc init [matmul] [API:I][12.370794] matmul primitive_desc create - attr [PROF:I][12.385248] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00138,ms [API:I][12.370807] matmul primitive create [CORE:I][12.385434] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.385437] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.370708] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.371178] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.469ms graph_exe_count=-1 weight_address=0x19155940 [PROF:I][12.385747] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.489396,ms [API:I][12.371308] matmul desc create - no bias [CORE:I][12.385978] matmul desc init [matmul] [API:I][12.371317] matmul primitive_desc create - attr [PROF:I][12.385768] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00094,ms [API:I][12.371327] matmul primitive create [CORE:I][12.385953] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.385957] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.371228] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.371552] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.325ms graph_exe_count=-1 weight_address=0x1a155980 [PROF:I][12.386123] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.345162,ms [PROF:V0][12.371683] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.57178,ms [CORE:I][12.386107] CPU Stream deleted [stream] [CORE:I][12.386512] CPU Engine deleted [engine] [API:I][12.386687] Memory create [CORE:V0][12.386676] Memory desc init by tag [memory] [CORE:I][12.386682] Memory created [memory] [API:I][12.386703] Memory create - strides [CORE:I][12.386689] Memory desc init by Stride [memory] [CORE:I][12.386692] Memory created [memory] [API:I][12.386714] Memory create [CORE:V0][12.386699] Memory desc init by tag [memory] [CORE:I][12.386703] Memory created [memory] [API:I][12.386635] matmul desc create - no bias [CORE:I][12.386633] matmul desc init [matmul] [API:I][12.386649] matmul primitive_desc create - attr [PROF:I][12.386430] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00186,ms [API:I][12.386662] matmul primitive create [CORE:I][12.386617] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.386621] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.371892] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.373590] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.698ms graph_exe_count=-1 weight_address=0x70de2fff3040 [PROF:I][12.388160] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.71854,ms [API:I][12.388526] Memory create [CORE:V0][12.388513] Memory desc init by tag [memory] [CORE:I][12.388517] Memory created [memory] [API:I][12.388539] Memory create - strides [CORE:I][12.388524] Memory desc init by Stride [memory] [CORE:I][12.388527] Memory created [memory] [API:I][12.388548] Memory create [CORE:V0][12.388534] Memory desc init by tag [memory] [CORE:I][12.388539] Memory created [memory] [API:I][12.388467] matmul desc create - no bias [CORE:I][12.388465] matmul desc init [matmul] [API:I][12.388481] matmul primitive_desc create - attr [PROF:I][12.388260] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00143,ms [API:I][12.388492] matmul primitive create [CORE:I][12.388448] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.388451] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.373722] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.380073] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.35ms graph_exe_count=-1 weight_address=0x70de33ff4040 [PROF:I][12.394666] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.39349,ms [API:I][12.395082] Memory create [CORE:V0][12.395070] Memory desc init by tag [memory] [CORE:I][12.395078] Memory created [memory] [API:I][12.395100] Memory create - strides [CORE:I][12.395086] Memory desc init by Stride [memory] [CORE:I][12.395090] Memory created [memory] [API:I][12.395110] Memory create [CORE:V0][12.395095] Memory desc init by tag [memory] [CORE:I][12.395100] Memory created [memory] [API:I][12.395124] Memory create [CORE:V0][12.395110] Memory desc init by tag [memory] [CORE:I][12.395115] Memory created [memory] [API:I][12.395141] Memory create [CORE:V0][12.395127] Memory desc init by tag [memory] [CORE:I][12.395131] Memory created [memory] [API:I][12.395066] matmul desc create - no bias [CORE:I][12.395064] matmul desc init [matmul] [API:I][12.395084] matmul primitive_desc create - attr [PROF:I][12.394870] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.003681,ms [API:I][12.395103] matmul primitive create [CORE:I][12.395060] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.395063] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.380338] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.386387] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.051ms graph_exe_count=-1 weight_address=0x70de41ff5040 [PROF:I][12.400973] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.09004,ms [API:I][12.401401] Memory create [CORE:V0][12.401389] Memory desc init by tag [memory] [CORE:I][12.401397] Memory created [memory] [API:I][12.401419] Memory create - strides [CORE:I][12.401406] Memory desc init by Stride [memory] [CORE:I][12.401409] Memory created [memory] [API:I][12.401431] Memory create [CORE:V0][12.401417] Memory desc init by tag [memory] [CORE:I][12.401422] Memory created [memory] [API:I][12.401356] matmul desc create - no bias [CORE:I][12.401355] matmul desc init [matmul] [API:I][12.401378] matmul primitive_desc create - attr [PROF:I][12.401162] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00349,ms [API:I][12.401395] matmul primitive create [CORE:I][12.401353] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.401357] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.386632] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.392625] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=5.993ms graph_exe_count=-1 weight_address=0x70e0ddfff040 [PROF:I][12.407212] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.0357,ms [API:I][12.407699] Memory create [CORE:V0][12.407689] Memory desc init by tag [memory] [CORE:I][12.407699] Memory created [memory] [API:I][12.407721] Memory create - strides [CORE:I][12.407709] Memory desc init by Stride [memory] [CORE:I][12.407712] Memory created [memory] [API:I][12.407733] Memory create [CORE:V0][12.407720] Memory desc init by tag [memory] [CORE:I][12.407727] Memory created [memory] [API:I][12.407750] Memory create [CORE:V0][12.407736] Memory desc init by tag [memory] [CORE:I][12.407740] Memory created [memory] [API:I][12.407761] Memory create - strides [CORE:I][12.407748] Memory desc init by Stride [memory] [CORE:I][12.407751] Memory created [memory] [API:I][12.407776] Memory create [CORE:V0][12.407762] Memory desc init by tag [memory] [CORE:I][12.407766] Memory created [memory] [API:I][12.407788] Memory create [CORE:V0][12.407775] Memory desc init by tag [memory] [CORE:I][12.407780] Memory created [memory] [API:I][12.407801] Memory create - strides [CORE:I][12.407787] Memory desc init by Stride [memory] [CORE:I][12.407791] Memory created [memory] [API:I][12.407813] Memory create [CORE:V0][12.407799] Memory desc init by tag [memory] [CORE:I][12.407803] Memory created [memory] [API:I][12.393059] CPU Engine create [CORE:V0][12.407881] CPU Engine created [engine] [CORE:I][12.407885] CPU Engine created [cpu/engine] [API:I][12.393070] CPU Stream create [CORE:I][12.407492] CPU Stream created [stream] [CORE:V0][12.407491] CPU Stream created [cpu/stream] [API:I][12.393090] matmul desc create - no bias [CORE:I][12.407761] matmul desc init [matmul] [API:I][12.393111] matmul primitive_desc create - attr [PROF:I][12.407568] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00427,ms [API:I][12.393128] matmul primitive create [CORE:I][12.407758] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.407762] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.393037] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.394746] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.71ms graph_exe_count=-1 weight_address=0x70e0abffa040 [PROF:I][12.409317] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.73682,ms [API:I][12.394880] matmul desc create - no bias [CORE:I][12.409550] matmul desc init [matmul] [API:I][12.394892] matmul primitive_desc create - attr [PROF:I][12.409345] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00167,ms [API:I][12.394904] matmul primitive create [CORE:I][12.409530] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.409533] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.394804] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.395270] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.465ms graph_exe_count=-1 weight_address=0x2b1a62c0 [PROF:I][12.409839] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.484876,ms [API:I][12.395400] matmul desc create - no bias [CORE:I][12.410071] matmul desc init [matmul] [API:I][12.395409] matmul primitive_desc create - attr [PROF:I][12.409861] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00104,ms [API:I][12.395419] matmul primitive create [CORE:I][12.410045] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.410048] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.395319] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.395643] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.325ms graph_exe_count=-1 weight_address=0x2c1a6300 [PROF:I][12.410213] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.343341,ms [PROF:V0][12.395773] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.71484,ms [CORE:I][12.410198] CPU Stream deleted [stream] [CORE:I][12.410602] CPU Engine deleted [engine] [API:I][12.410764] Memory create [CORE:V0][12.410753] Memory desc init by tag [memory] [CORE:I][12.410759] Memory created [memory] [API:I][12.410780] Memory create - strides [CORE:I][12.410766] Memory desc init by Stride [memory] [CORE:I][12.410769] Memory created [memory] [API:I][12.410790] Memory create [CORE:V0][12.410777] Memory desc init by tag [memory] [CORE:I][12.410781] Memory created [memory] [API:I][12.410711] matmul desc create - no bias [CORE:I][12.410708] matmul desc init [matmul] [API:I][12.410727] matmul primitive_desc create - attr [PROF:I][12.410508] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00223,ms [API:I][12.410741] matmul primitive create [CORE:I][12.410696] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.410699] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.395970] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.397754] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.784ms graph_exe_count=-1 weight_address=0x70e0afffb040 [PROF:I][12.412324] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.8044,ms [API:I][12.412695] Memory create [CORE:V0][12.412682] Memory desc init by tag [memory] [CORE:I][12.412687] Memory created [memory] [API:I][12.412708] Memory create - strides [CORE:I][12.412694] Memory desc init by Stride [memory] [CORE:I][12.412698] Memory created [memory] [API:I][12.412721] Memory create [CORE:V0][12.412707] Memory desc init by tag [memory] [CORE:I][12.412711] Memory created [memory] [API:I][12.412641] matmul desc create - no bias [CORE:I][12.412638] matmul desc init [matmul] [API:I][12.412654] matmul primitive_desc create - attr [PROF:I][12.412435] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.001641,ms [API:I][12.412668] matmul primitive create [CORE:I][12.412623] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.412626] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.397897] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.404231] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.33301ms graph_exe_count=-1 weight_address=0x70e0b3ffc040 [PROF:I][12.418815] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.36746,ms [API:I][12.419221] Memory create [CORE:V0][12.419210] Memory desc init by tag [memory] [CORE:I][12.419219] Memory created [memory] [API:I][12.419241] Memory create - strides [CORE:I][12.419227] Memory desc init by Stride [memory] [CORE:I][12.419230] Memory created [memory] [API:I][12.419251] Memory create [CORE:V0][12.419237] Memory desc init by tag [memory] [CORE:I][12.419242] Memory created [memory] [API:I][12.419266] Memory create [CORE:V0][12.419251] Memory desc init by tag [memory] [CORE:I][12.419256] Memory created [memory] [API:I][12.419282] Memory create [CORE:V0][12.419268] Memory desc init by tag [memory] [CORE:I][12.419272] Memory created [memory] [API:I][12.419205] matmul desc create - no bias [CORE:I][12.419205] matmul desc init [matmul] [API:I][12.419223] matmul primitive_desc create - attr [PROF:I][12.419009] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.002931,ms [API:I][12.419242] matmul primitive create [CORE:I][12.419199] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.419203] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.404478] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.410901] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.425ms graph_exe_count=-1 weight_address=0x70e0c1ffd040 [PROF:I][12.425484] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.46145,ms [API:I][12.425900] Memory create [CORE:V0][12.425888] Memory desc init by tag [memory] [CORE:I][12.425896] Memory created [memory] [API:I][12.425918] Memory create - strides [CORE:I][12.425905] Memory desc init by Stride [memory] [CORE:I][12.425909] Memory created [memory] [API:I][12.425931] Memory create [CORE:V0][12.425916] Memory desc init by tag [memory] [CORE:I][12.425920] Memory created [memory] [API:I][12.425853] matmul desc create - no bias [CORE:I][12.425851] matmul desc init [matmul] [API:I][12.425879] matmul primitive_desc create - attr [PROF:I][12.425663] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00335,ms [API:I][12.425895] matmul primitive create [CORE:I][12.425852] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.425856] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.411128] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.417588] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.46ms graph_exe_count=-1 weight_address=0x70e0cfffe040 [PROF:I][12.432174] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.49828,ms [API:I][12.432658] Memory create [CORE:V0][12.432648] Memory desc init by tag [memory] [CORE:I][12.432657] Memory created [memory] [API:I][12.432679] Memory create - strides [CORE:I][12.432665] Memory desc init by Stride [memory] [CORE:I][12.432668] Memory created [memory] [API:I][12.432690] Memory create [CORE:V0][12.432675] Memory desc init by tag [memory] [CORE:I][12.432680] Memory created [memory] [API:I][12.432702] Memory create [CORE:V0][12.432688] Memory desc init by tag [memory] [CORE:I][12.432692] Memory created [memory] [API:I][12.432714] Memory create - strides [CORE:I][12.432699] Memory desc init by Stride [memory] [CORE:I][12.432703] Memory created [memory] [API:I][12.432726] Memory create [CORE:V0][12.432711] Memory desc init by tag [memory] [CORE:I][12.432717] Memory created [memory] [API:I][12.432739] Memory create [CORE:V0][12.432724] Memory desc init by tag [memory] [CORE:I][12.432728] Memory created [memory] [API:I][12.432750] Memory create - strides [CORE:I][12.432735] Memory desc init by Stride [memory] [CORE:I][12.432739] Memory created [memory] [API:I][12.432760] Memory create [CORE:V0][12.432745] Memory desc init by tag [memory] [CORE:I][12.432750] Memory created [memory] [API:I][12.418005] CPU Engine create [CORE:V0][12.432828] CPU Engine created [engine] [CORE:I][12.432833] CPU Engine created [cpu/engine] [API:I][12.418019] CPU Stream create [CORE:I][12.432441] CPU Stream created [stream] [CORE:V0][12.432441] CPU Stream created [cpu/stream] [API:I][12.418040] matmul desc create - no bias [CORE:I][12.432711] matmul desc init [matmul] [API:I][12.418062] matmul primitive_desc create - attr [PROF:I][12.432521] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00471,ms [API:I][12.418082] matmul primitive create [CORE:I][12.432711] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.432715] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.417989] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.419720] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.732ms graph_exe_count=-1 weight_address=0x70e079ff5040 [PROF:I][12.434292] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.75884,ms [API:I][12.419855] matmul desc create - no bias [CORE:I][12.434526] matmul desc init [matmul] [API:I][12.419867] matmul primitive_desc create - attr [PROF:I][12.434320] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00142,ms [API:I][12.419880] matmul primitive create [CORE:I][12.434506] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.434509] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.419780] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.420233] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.453ms graph_exe_count=-1 weight_address=0x2d1ae3c0 [PROF:I][12.434802] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.471945,ms [API:I][12.420364] matmul desc create - no bias [CORE:I][12.435033] matmul desc init [matmul] [API:I][12.420372] matmul primitive_desc create - attr [PROF:I][12.434825] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00147,ms [API:I][12.420384] matmul primitive create [CORE:I][12.435009] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.435012] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.420283] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.420605] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.321ms graph_exe_count=-1 weight_address=0x2e1ae400 [PROF:I][12.435174] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.340271,ms [PROF:V0][12.420734] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.729,ms [CORE:I][12.435158] CPU Stream deleted [stream] [CORE:I][12.435562] CPU Engine deleted [engine] [API:I][12.435727] Memory create [CORE:V0][12.435716] Memory desc init by tag [memory] [CORE:I][12.435722] Memory created [memory] [API:I][12.435744] Memory create - strides [CORE:I][12.435731] Memory desc init by Stride [memory] [CORE:I][12.435735] Memory created [memory] [API:I][12.435757] Memory create [CORE:V0][12.435742] Memory desc init by tag [memory] [CORE:I][12.435746] Memory created [memory] [API:I][12.435676] matmul desc create - no bias [CORE:I][12.435674] matmul desc init [matmul] [API:I][12.435690] matmul primitive_desc create - attr [PROF:I][12.435471] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00189,ms [API:I][12.435703] matmul primitive create [CORE:I][12.435658] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.435662] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.420934] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.422633] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.699ms graph_exe_count=-1 weight_address=0x70e07dff6040 [PROF:I][12.437204] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.72188,ms [API:I][12.437571] Memory create [CORE:V0][12.437558] Memory desc init by tag [memory] [CORE:I][12.437563] Memory created [memory] [API:I][12.437585] Memory create - strides [CORE:I][12.437570] Memory desc init by Stride [memory] [CORE:I][12.437575] Memory created [memory] [API:I][12.437596] Memory create [CORE:V0][12.437582] Memory desc init by tag [memory] [CORE:I][12.437587] Memory created [memory] [API:I][12.437516] matmul desc create - no bias [CORE:I][12.437514] matmul desc init [matmul] [API:I][12.437531] matmul primitive_desc create - attr [PROF:I][12.437311] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00148,ms [API:I][12.437543] matmul primitive create [CORE:I][12.437497] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.437501] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.422771] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.429298] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.527ms graph_exe_count=-1 weight_address=0x70e081ff7040 [PROF:I][12.443883] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.56126,ms [API:I][12.444291] Memory create [CORE:V0][12.444279] Memory desc init by tag [memory] [CORE:I][12.444286] Memory created [memory] [API:I][12.444308] Memory create - strides [CORE:I][12.444294] Memory desc init by Stride [memory] [CORE:I][12.444298] Memory created [memory] [API:I][12.444320] Memory create [CORE:V0][12.444308] Memory desc init by tag [memory] [CORE:I][12.444315] Memory created [memory] [API:I][12.444339] Memory create [CORE:V0][12.444324] Memory desc init by tag [memory] [CORE:I][12.444328] Memory created [memory] [API:I][12.444355] Memory create [CORE:V0][12.444340] Memory desc init by tag [memory] [CORE:I][12.444345] Memory created [memory] [API:I][12.444279] matmul desc create - no bias [CORE:I][12.444277] matmul desc init [matmul] [API:I][12.444297] matmul primitive_desc create - attr [PROF:I][12.444082] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.003881,ms [API:I][12.444316] matmul primitive create [CORE:I][12.444273] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.444276] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.429551] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.435876] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.328ms graph_exe_count=-1 weight_address=0x70e08fff8040 [PROF:I][12.450462] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.36513,ms [API:I][12.450881] Memory create [CORE:V0][12.450869] Memory desc init by tag [memory] [CORE:I][12.450878] Memory created [memory] [API:I][12.450900] Memory create - strides [CORE:I][12.450886] Memory desc init by Stride [memory] [CORE:I][12.450890] Memory created [memory] [API:I][12.450911] Memory create [CORE:V0][12.450896] Memory desc init by tag [memory] [CORE:I][12.450902] Memory created [memory] [API:I][12.450835] matmul desc create - no bias [CORE:I][12.450834] matmul desc init [matmul] [API:I][12.450856] matmul primitive_desc create - attr [PROF:I][12.450640] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00313,ms [API:I][12.450883] matmul primitive create [CORE:I][12.450841] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.450845] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.436118] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.442405] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.288ms graph_exe_count=-1 weight_address=0x70e09dff9040 [PROF:I][12.456991] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.32587,ms [API:I][12.457475] Memory create [CORE:V0][12.457465] Memory desc init by tag [memory] [CORE:I][12.457473] Memory created [memory] [API:I][12.457495] Memory create - strides [CORE:I][12.457481] Memory desc init by Stride [memory] [CORE:I][12.457485] Memory created [memory] [API:I][12.457506] Memory create [CORE:V0][12.457492] Memory desc init by tag [memory] [CORE:I][12.457496] Memory created [memory] [API:I][12.457518] Memory create [CORE:V0][12.457505] Memory desc init by tag [memory] [CORE:I][12.457509] Memory created [memory] [API:I][12.457530] Memory create - strides [CORE:I][12.457516] Memory desc init by Stride [memory] [CORE:I][12.457520] Memory created [memory] [API:I][12.457543] Memory create [CORE:V0][12.457529] Memory desc init by tag [memory] [CORE:I][12.457533] Memory created [memory] [API:I][12.457555] Memory create [CORE:V0][12.457541] Memory desc init by tag [memory] [CORE:I][12.457546] Memory created [memory] [API:I][12.457567] Memory create - strides [CORE:I][12.457552] Memory desc init by Stride [memory] [CORE:I][12.457557] Memory created [memory] [API:I][12.457578] Memory create [CORE:V0][12.457564] Memory desc init by tag [memory] [CORE:I][12.457568] Memory created [memory] [API:I][12.442823] CPU Engine create [CORE:V0][12.457645] CPU Engine created [engine] [CORE:I][12.457649] CPU Engine created [cpu/engine] [API:I][12.442835] CPU Stream create [CORE:I][12.457257] CPU Stream created [stream] [CORE:V0][12.457256] CPU Stream created [cpu/stream] [API:I][12.442855] matmul desc create - no bias [CORE:I][12.457526] matmul desc init [matmul] [API:I][12.442875] matmul primitive_desc create - attr [PROF:I][12.457333] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00344,ms [API:I][12.442893] matmul primitive create [CORE:I][12.457522] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.457526] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.442802] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.444482] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.68ms graph_exe_count=-1 weight_address=0x70e047ff0040 [PROF:I][12.459053] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.70758,ms [API:I][12.444616] matmul desc create - no bias [CORE:I][12.459287] matmul desc init [matmul] [API:I][12.444629] matmul primitive_desc create - attr [PROF:I][12.459082] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00166,ms [API:I][12.444641] matmul primitive create [CORE:I][12.459267] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.459270] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.444541] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.444962] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.422ms graph_exe_count=-1 weight_address=0x2f1b64c0 [PROF:I][12.459532] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.441045,ms [API:I][12.445092] matmul desc create - no bias [CORE:I][12.459762] matmul desc init [matmul] [API:I][12.445100] matmul primitive_desc create - attr [PROF:I][12.459551] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00096,ms [API:I][12.445110] matmul primitive create [CORE:I][12.459736] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.459739] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.445009] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.445338] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.329ms graph_exe_count=-1 weight_address=0x301b6500 [PROF:I][12.459908] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.347791,ms [PROF:V0][12.445469] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.64502,ms [CORE:I][12.459893] CPU Stream deleted [stream] [CORE:I][12.460299] CPU Engine deleted [engine] [API:I][12.460466] Memory create [CORE:V0][12.460455] Memory desc init by tag [memory] [CORE:I][12.460461] Memory created [memory] [API:I][12.460483] Memory create - strides [CORE:I][12.460469] Memory desc init by Stride [memory] [CORE:I][12.460473] Memory created [memory] [API:I][12.460494] Memory create [CORE:V0][12.460480] Memory desc init by tag [memory] [CORE:I][12.460486] Memory created [memory] [API:I][12.460416] matmul desc create - no bias [CORE:I][12.460414] matmul desc init [matmul] [API:I][12.460430] matmul primitive_desc create - attr [PROF:I][12.460211] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00208,ms [API:I][12.460443] matmul primitive create [CORE:I][12.460400] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.460403] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.445675] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.447319] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.645ms graph_exe_count=-1 weight_address=0x70e04bff1040 [PROF:I][12.461890] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.66576,ms [API:I][12.462261] Memory create [CORE:V0][12.462248] Memory desc init by tag [memory] [CORE:I][12.462252] Memory created [memory] [API:I][12.462273] Memory create - strides [CORE:I][12.462258] Memory desc init by Stride [memory] [CORE:I][12.462262] Memory created [memory] [API:I][12.462283] Memory create [CORE:V0][12.462268] Memory desc init by tag [memory] [CORE:I][12.462272] Memory created [memory] [API:I][12.462201] matmul desc create - no bias [CORE:I][12.462198] matmul desc init [matmul] [API:I][12.462213] matmul primitive_desc create - attr [PROF:I][12.461993] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00136,ms [API:I][12.462225] matmul primitive create [CORE:I][12.462179] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.462183] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.447454] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.453691] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.237ms graph_exe_count=-1 weight_address=0x70e04fff2040 [PROF:I][12.468279] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.27448,ms [API:I][12.468698] Memory create [CORE:V0][12.468686] Memory desc init by tag [memory] [CORE:I][12.468695] Memory created [memory] [API:I][12.468717] Memory create - strides [CORE:I][12.468706] Memory desc init by Stride [memory] [CORE:I][12.468709] Memory created [memory] [API:I][12.468730] Memory create [CORE:V0][12.468716] Memory desc init by tag [memory] [CORE:I][12.468721] Memory created [memory] [API:I][12.468747] Memory create [CORE:V0][12.468732] Memory desc init by tag [memory] [CORE:I][12.468738] Memory created [memory] [API:I][12.468763] Memory create [CORE:V0][12.468749] Memory desc init by tag [memory] [CORE:I][12.468752] Memory created [memory] [API:I][12.468686] matmul desc create - no bias [CORE:I][12.468685] matmul desc init [matmul] [API:I][12.468707] matmul primitive_desc create - attr [PROF:I][12.468492] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.003931,ms [API:I][12.468725] matmul primitive create [CORE:I][12.468682] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.468686] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.453959] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.460173] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.215ms graph_exe_count=-1 weight_address=0x70e05dff3040 [PROF:I][12.474760] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.25394,ms [API:I][12.475185] Memory create [CORE:V0][12.475173] Memory desc init by tag [memory] [CORE:I][12.475182] Memory created [memory] [API:I][12.475203] Memory create - strides [CORE:I][12.475190] Memory desc init by Stride [memory] [CORE:I][12.475193] Memory created [memory] [API:I][12.475215] Memory create [CORE:V0][12.475201] Memory desc init by tag [memory] [CORE:I][12.475207] Memory created [memory] [API:I][12.475141] matmul desc create - no bias [CORE:I][12.475139] matmul desc init [matmul] [API:I][12.475163] matmul primitive_desc create - attr [PROF:I][12.474946] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.0026,ms [API:I][12.475179] matmul primitive create [CORE:I][12.475136] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.475140] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.460413] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.466969] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.556ms graph_exe_count=-1 weight_address=0x70e06bff4040 [PROF:I][12.481556] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.59668,ms [API:I][12.482050] Memory create [CORE:V0][12.482040] Memory desc init by tag [memory] [CORE:I][12.482049] Memory created [memory] [API:I][12.482071] Memory create - strides [CORE:I][12.482057] Memory desc init by Stride [memory] [CORE:I][12.482061] Memory created [memory] [API:I][12.482083] Memory create [CORE:V0][12.482069] Memory desc init by tag [memory] [CORE:I][12.482073] Memory created [memory] [API:I][12.482097] Memory create [CORE:V0][12.482082] Memory desc init by tag [memory] [CORE:I][12.482086] Memory created [memory] [API:I][12.482107] Memory create - strides [CORE:I][12.482093] Memory desc init by Stride [memory] [CORE:I][12.482098] Memory created [memory] [API:I][12.482123] Memory create [CORE:V0][12.482109] Memory desc init by tag [memory] [CORE:I][12.482113] Memory created [memory] [API:I][12.482135] Memory create [CORE:V0][12.482121] Memory desc init by tag [memory] [CORE:I][12.482125] Memory created [memory] [API:I][12.482148] Memory create - strides [CORE:I][12.482134] Memory desc init by Stride [memory] [CORE:I][12.482137] Memory created [memory] [API:I][12.482159] Memory create [CORE:V0][12.482145] Memory desc init by tag [memory] [CORE:I][12.482151] Memory created [memory] [API:I][12.467407] CPU Engine create [CORE:V0][12.482229] CPU Engine created [engine] [CORE:I][12.482233] CPU Engine created [cpu/engine] [API:I][12.467419] CPU Stream create [CORE:I][12.481840] CPU Stream created [stream] [CORE:V0][12.481842] CPU Stream created [cpu/stream] [API:I][12.467440] matmul desc create - no bias [CORE:I][12.482111] matmul desc init [matmul] [API:I][12.467463] matmul primitive_desc create - attr [PROF:I][12.481919] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.0034,ms [API:I][12.467479] matmul primitive create [CORE:I][12.482110] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.482114] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.467389] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.469044] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.656ms graph_exe_count=-1 weight_address=0x70e015feb040 [PROF:I][12.483616] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.68239,ms [API:I][12.469179] matmul desc create - no bias [CORE:I][12.483850] matmul desc init [matmul] [API:I][12.469191] matmul primitive_desc create - attr [PROF:I][12.483652] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00916,ms [API:I][12.469211] matmul primitive create [CORE:I][12.483837] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.483840] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.469112] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.469561] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.449ms graph_exe_count=-1 weight_address=0x311be5c0 [PROF:I][12.484132] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.470335,ms [API:I][12.469693] matmul desc create - no bias [CORE:I][12.484363] matmul desc init [matmul] [API:I][12.469701] matmul primitive_desc create - attr [PROF:I][12.484152] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00087,ms [API:I][12.469711] matmul primitive create [CORE:I][12.484337] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.484340] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.469611] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.469927] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.316ms graph_exe_count=-1 weight_address=0x321be600 [PROF:I][12.484497] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.335231,ms [PROF:V0][12.470058] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.65088,ms [CORE:I][12.484482] CPU Stream deleted [stream] [CORE:I][12.484886] CPU Engine deleted [engine] [API:I][12.485063] Memory create [CORE:V0][12.485052] Memory desc init by tag [memory] [CORE:I][12.485057] Memory created [memory] [API:I][12.485079] Memory create - strides [CORE:I][12.485064] Memory desc init by Stride [memory] [CORE:I][12.485068] Memory created [memory] [API:I][12.485089] Memory create [CORE:V0][12.485076] Memory desc init by tag [memory] [CORE:I][12.485080] Memory created [memory] [API:I][12.485010] matmul desc create - no bias [CORE:I][12.485008] matmul desc init [matmul] [API:I][12.485025] matmul primitive_desc create - attr [PROF:I][12.484806] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00232,ms [API:I][12.485038] matmul primitive create [CORE:I][12.484994] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.484997] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.470269] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.471925] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.657ms graph_exe_count=-1 weight_address=0x70e019fec040 [PROF:I][12.486496] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.67823,ms [API:I][12.486861] Memory create [CORE:V0][12.486848] Memory desc init by tag [memory] [CORE:I][12.486853] Memory created [memory] [API:I][12.486874] Memory create - strides [CORE:I][12.486859] Memory desc init by Stride [memory] [CORE:I][12.486863] Memory created [memory] [API:I][12.486884] Memory create [CORE:V0][12.486869] Memory desc init by tag [memory] [CORE:I][12.486873] Memory created [memory] [API:I][12.486802] matmul desc create - no bias [CORE:I][12.486799] matmul desc init [matmul] [API:I][12.486814] matmul primitive_desc create - attr [PROF:I][12.486594] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00126,ms [API:I][12.486826] matmul primitive create [CORE:I][12.486781] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.486785] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.472056] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.478507] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.452ms graph_exe_count=-1 weight_address=0x70e01dfed040 [PROF:I][12.493093] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.48736,ms [API:I][12.493510] Memory create [CORE:V0][12.493498] Memory desc init by tag [memory] [CORE:I][12.493506] Memory created [memory] [API:I][12.493528] Memory create - strides [CORE:I][12.493514] Memory desc init by Stride [memory] [CORE:I][12.493517] Memory created [memory] [API:I][12.493538] Memory create [CORE:V0][12.493523] Memory desc init by tag [memory] [CORE:I][12.493530] Memory created [memory] [API:I][12.493555] Memory create [CORE:V0][12.493540] Memory desc init by tag [memory] [CORE:I][12.493545] Memory created [memory] [API:I][12.493571] Memory create [CORE:V0][12.493556] Memory desc init by tag [memory] [CORE:I][12.493560] Memory created [memory] [API:I][12.493497] matmul desc create - no bias [CORE:I][12.493496] matmul desc init [matmul] [API:I][12.493519] matmul primitive_desc create - attr [PROF:I][12.493304] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00377,ms [API:I][12.493538] matmul primitive create [CORE:I][12.493495] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.493498] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.478774] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.484927] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.155ms graph_exe_count=-1 weight_address=0x70e02bfee040 [PROF:I][12.499513] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.19473,ms [API:I][12.499934] Memory create [CORE:V0][12.499922] Memory desc init by tag [memory] [CORE:I][12.499930] Memory created [memory] [API:I][12.499952] Memory create - strides [CORE:I][12.499938] Memory desc init by Stride [memory] [CORE:I][12.499942] Memory created [memory] [API:I][12.499963] Memory create [CORE:V0][12.499958] Memory desc init by tag [memory] [CORE:I][12.499963] Memory created [memory] [API:I][12.499899] matmul desc create - no bias [CORE:I][12.499897] matmul desc init [matmul] [API:I][12.499921] matmul primitive_desc create - attr [PROF:I][12.499706] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00352,ms [API:I][12.499938] matmul primitive create [CORE:I][12.499896] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.499899] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.485172] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.491045] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=5.874ms graph_exe_count=-1 weight_address=0x70e039fef040 [PROF:I][12.505633] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,5.91383,ms [API:I][12.506127] Memory create [CORE:V0][12.506118] Memory desc init by tag [memory] [CORE:I][12.506127] Memory created [memory] [API:I][12.506149] Memory create - strides [CORE:I][12.506136] Memory desc init by Stride [memory] [CORE:I][12.506139] Memory created [memory] [API:I][12.506161] Memory create [CORE:V0][12.506146] Memory desc init by tag [memory] [CORE:I][12.506152] Memory created [memory] [API:I][12.506174] Memory create [CORE:V0][12.506159] Memory desc init by tag [memory] [CORE:I][12.506163] Memory created [memory] [API:I][12.506186] Memory create - strides [CORE:I][12.506171] Memory desc init by Stride [memory] [CORE:I][12.506175] Memory created [memory] [API:I][12.506198] Memory create [CORE:V0][12.506183] Memory desc init by tag [memory] [CORE:I][12.506187] Memory created [memory] [API:I][12.506210] Memory create [CORE:V0][12.506195] Memory desc init by tag [memory] [CORE:I][12.506200] Memory created [memory] [API:I][12.506221] Memory create - strides [CORE:I][12.506206] Memory desc init by Stride [memory] [CORE:I][12.506211] Memory created [memory] [API:I][12.506232] Memory create [CORE:V0][12.506217] Memory desc init by tag [memory] [CORE:I][12.506221] Memory created [memory] [API:I][12.491478] CPU Engine create [CORE:V0][12.506300] CPU Engine created [engine] [CORE:I][12.506304] CPU Engine created [cpu/engine] [API:I][12.491490] CPU Stream create [CORE:I][12.505911] CPU Stream created [stream] [CORE:V0][12.505912] CPU Stream created [cpu/stream] [API:I][12.491510] matmul desc create - no bias [CORE:I][12.506181] matmul desc init [matmul] [API:I][12.491530] matmul primitive_desc create - attr [PROF:I][12.505988] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.0038,ms [API:I][12.491548] matmul primitive create [CORE:I][12.506177] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.506181] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.491456] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.493192] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.737ms graph_exe_count=-1 weight_address=0x70dfe3fe6040 [PROF:I][12.507764] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.76419,ms [API:I][12.493328] matmul desc create - no bias [CORE:I][12.507999] matmul desc init [matmul] [API:I][12.493341] matmul primitive_desc create - attr [PROF:I][12.507794] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00164,ms [API:I][12.493353] matmul primitive create [CORE:I][12.507979] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.507982] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.493253] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.493701] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.449ms graph_exe_count=-1 weight_address=0x331c66c0 [PROF:I][12.508271] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.467915,ms [API:I][12.493832] matmul desc create - no bias [CORE:I][12.508502] matmul desc init [matmul] [API:I][12.493840] matmul primitive_desc create - attr [PROF:I][12.508291] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00097,ms [API:I][12.493850] matmul primitive create [CORE:I][12.508476] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.508479] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.493750] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.494074] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.323ms graph_exe_count=-1 weight_address=0x341c6700 [PROF:I][12.508654] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.352982,ms [PROF:V0][12.494214] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.73706,ms [CORE:I][12.508638] CPU Stream deleted [stream] [CORE:I][12.509044] CPU Engine deleted [engine] [API:I][12.509212] Memory create [CORE:V0][12.509201] Memory desc init by tag [memory] [CORE:I][12.509207] Memory created [memory] [API:I][12.509228] Memory create - strides [CORE:I][12.509214] Memory desc init by Stride [memory] [CORE:I][12.509219] Memory created [memory] [API:I][12.509241] Memory create [CORE:V0][12.509227] Memory desc init by tag [memory] [CORE:I][12.509232] Memory created [memory] [API:I][12.509162] matmul desc create - no bias [CORE:I][12.509160] matmul desc init [matmul] [API:I][12.509176] matmul primitive_desc create - attr [PROF:I][12.508956] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00176,ms [API:I][12.509188] matmul primitive create [CORE:I][12.509144] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.509148] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.494420] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.496154] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.735ms graph_exe_count=-1 weight_address=0x70dfe7fe7040 [PROF:I][12.510725] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.75759,ms [API:I][12.511095] Memory create [CORE:V0][12.511082] Memory desc init by tag [memory] [CORE:I][12.511087] Memory created [memory] [API:I][12.511108] Memory create - strides [CORE:I][12.511094] Memory desc init by Stride [memory] [CORE:I][12.511098] Memory created [memory] [API:I][12.511119] Memory create [CORE:V0][12.511105] Memory desc init by tag [memory] [CORE:I][12.511111] Memory created [memory] [API:I][12.511040] matmul desc create - no bias [CORE:I][12.511038] matmul desc init [matmul] [API:I][12.511053] matmul primitive_desc create - attr [PROF:I][12.510833] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00153,ms [API:I][12.511065] matmul primitive create [CORE:I][12.511020] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.511024] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.496295] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.502673] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.379ms graph_exe_count=-1 weight_address=0x70dfebfe8040 [PROF:I][12.517259] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.41481,ms [API:I][12.517672] Memory create [CORE:V0][12.517660] Memory desc init by tag [memory] [CORE:I][12.517669] Memory created [memory] [API:I][12.517691] Memory create - strides [CORE:I][12.517676] Memory desc init by Stride [memory] [CORE:I][12.517680] Memory created [memory] [API:I][12.517701] Memory create [CORE:V0][12.517687] Memory desc init by tag [memory] [CORE:I][12.517693] Memory created [memory] [API:I][12.517717] Memory create [CORE:V0][12.517702] Memory desc init by tag [memory] [CORE:I][12.517706] Memory created [memory] [API:I][12.517732] Memory create [CORE:V0][12.517719] Memory desc init by tag [memory] [CORE:I][12.517724] Memory created [memory] [API:I][12.517658] matmul desc create - no bias [CORE:I][12.517656] matmul desc init [matmul] [API:I][12.517678] matmul primitive_desc create - attr [PROF:I][12.517463] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.003391,ms [API:I][12.517697] matmul primitive create [CORE:I][12.517654] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.517657] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.502933] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.509366] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.435ms graph_exe_count=-1 weight_address=0x70dff9fe9040 [PROF:I][12.523950] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.47262,ms [API:I][12.524373] Memory create [CORE:V0][12.524360] Memory desc init by tag [memory] [CORE:I][12.524369] Memory created [memory] [API:I][12.524391] Memory create - strides [CORE:I][12.524377] Memory desc init by Stride [memory] [CORE:I][12.524381] Memory created [memory] [API:I][12.524401] Memory create [CORE:V0][12.524387] Memory desc init by tag [memory] [CORE:I][12.524391] Memory created [memory] [API:I][12.524325] matmul desc create - no bias [CORE:I][12.524323] matmul desc init [matmul] [API:I][12.524345] matmul primitive_desc create - attr [PROF:I][12.524129] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00345,ms [API:I][12.524361] matmul primitive create [CORE:I][12.524318] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.524322] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.509596] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.515739] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.144ms graph_exe_count=-1 weight_address=0x70e007fea040 [PROF:I][12.530324] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.18214,ms [API:I][12.530797] Memory create [CORE:V0][12.530787] Memory desc init by tag [memory] [CORE:I][12.530796] Memory created [memory] [API:I][12.530818] Memory create - strides [CORE:I][12.530806] Memory desc init by Stride [memory] [CORE:I][12.530809] Memory created [memory] [API:I][12.530832] Memory create [CORE:V0][12.530817] Memory desc init by tag [memory] [CORE:I][12.530822] Memory created [memory] [API:I][12.530844] Memory create [CORE:V0][12.530830] Memory desc init by tag [memory] [CORE:I][12.530835] Memory created [memory] [API:I][12.530856] Memory create - strides [CORE:I][12.530842] Memory desc init by Stride [memory] [CORE:I][12.530847] Memory created [memory] [API:I][12.530871] Memory create [CORE:V0][12.530857] Memory desc init by tag [memory] [CORE:I][12.530861] Memory created [memory] [API:I][12.530883] Memory create [CORE:V0][12.530869] Memory desc init by tag [memory] [CORE:I][12.530873] Memory created [memory] [API:I][12.530894] Memory create - strides [CORE:I][12.530880] Memory desc init by Stride [memory] [CORE:I][12.530884] Memory created [memory] [API:I][12.530905] Memory create [CORE:V0][12.530890] Memory desc init by tag [memory] [CORE:I][12.530894] Memory created [memory] [API:I][12.516150] CPU Engine create [CORE:V0][12.530972] CPU Engine created [engine] [CORE:I][12.530976] CPU Engine created [cpu/engine] [API:I][12.516162] CPU Stream create [CORE:I][12.530584] CPU Stream created [stream] [CORE:V0][12.530583] CPU Stream created [cpu/stream] [API:I][12.516182] matmul desc create - no bias [CORE:I][12.530853] matmul desc init [matmul] [API:I][12.516207] matmul primitive_desc create - attr [PROF:I][12.530664] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00414,ms [API:I][12.516224] matmul primitive create [CORE:I][12.530853] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.530856] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.516130] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.517777] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.648ms graph_exe_count=-1 weight_address=0x70db5fdfe040 [PROF:I][12.532348] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.6727,ms [API:I][12.517910] matmul desc create - no bias [CORE:I][12.532581] matmul desc init [matmul] [API:I][12.517923] matmul primitive_desc create - attr [PROF:I][12.532376] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00137,ms [API:I][12.517935] matmul primitive create [CORE:I][12.532561] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.532565] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.517836] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.518266] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.431ms graph_exe_count=-1 weight_address=0x351ce7c0 [PROF:I][12.532836] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.450224,ms [API:I][12.518397] matmul desc create - no bias [CORE:I][12.533067] matmul desc init [matmul] [API:I][12.518406] matmul primitive_desc create - attr [PROF:I][12.532857] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00087,ms [API:I][12.518416] matmul primitive create [CORE:I][12.533042] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.533045] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.518316] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.518625] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.31ms graph_exe_count=-1 weight_address=0x361ce800 [PROF:I][12.533194] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.327861,ms [PROF:V0][12.518753] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.60303,ms [CORE:I][12.533178] CPU Stream deleted [stream] [CORE:I][12.533583] CPU Engine deleted [engine] [API:I][12.533750] Memory create [CORE:V0][12.533739] Memory desc init by tag [memory] [CORE:I][12.533745] Memory created [memory] [API:I][12.533766] Memory create - strides [CORE:I][12.533752] Memory desc init by Stride [memory] [CORE:I][12.533756] Memory created [memory] [API:I][12.533778] Memory create [CORE:V0][12.533763] Memory desc init by tag [memory] [CORE:I][12.533769] Memory created [memory] [API:I][12.533700] matmul desc create - no bias [CORE:I][12.533697] matmul desc init [matmul] [API:I][12.533714] matmul primitive_desc create - attr [PROF:I][12.533495] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00177,ms [API:I][12.533728] matmul primitive create [CORE:I][12.533682] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.533685] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.518957] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.520693] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.737ms graph_exe_count=-1 weight_address=0x70dfc3fe3040 [PROF:I][12.535264] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.75743,ms [API:I][12.535628] Memory create [CORE:V0][12.535615] Memory desc init by tag [memory] [CORE:I][12.535620] Memory created [memory] [API:I][12.535642] Memory create - strides [CORE:I][12.535627] Memory desc init by Stride [memory] [CORE:I][12.535631] Memory created [memory] [API:I][12.535653] Memory create [CORE:V0][12.535639] Memory desc init by tag [memory] [CORE:I][12.535642] Memory created [memory] [API:I][12.535572] matmul desc create - no bias [CORE:I][12.535570] matmul desc init [matmul] [API:I][12.535585] matmul primitive_desc create - attr [PROF:I][12.535365] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.001731,ms [API:I][12.535597] matmul primitive create [CORE:I][12.535552] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.535556] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.520827] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.527342] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.515ms graph_exe_count=-1 weight_address=0x70db63dff040 [PROF:I][12.541925] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.54845,ms [API:I][12.542327] Memory create [CORE:V0][12.542315] Memory desc init by tag [memory] [CORE:I][12.542323] Memory created [memory] [API:I][12.542344] Memory create - strides [CORE:I][12.542330] Memory desc init by Stride [memory] [CORE:I][12.542334] Memory created [memory] [API:I][12.542356] Memory create [CORE:V0][12.542342] Memory desc init by tag [memory] [CORE:I][12.542347] Memory created [memory] [API:I][12.542371] Memory create [CORE:V0][12.542357] Memory desc init by tag [memory] [CORE:I][12.542361] Memory created [memory] [API:I][12.542386] Memory create [CORE:V0][12.542372] Memory desc init by tag [memory] [CORE:I][12.542375] Memory created [memory] [API:I][12.542308] matmul desc create - no bias [CORE:I][12.542307] matmul desc init [matmul] [API:I][12.542326] matmul primitive_desc create - attr [PROF:I][12.542111] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00373,ms [API:I][12.542345] matmul primitive create [CORE:I][12.542302] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.542305] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.527579] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.534129] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.551ms graph_exe_count=-1 weight_address=0x70dfc7fe4040 [PROF:I][12.548714] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.58829,ms [API:I][12.549127] Memory create [CORE:V0][12.549115] Memory desc init by tag [memory] [CORE:I][12.549125] Memory created [memory] [API:I][12.549146] Memory create - strides [CORE:I][12.549133] Memory desc init by Stride [memory] [CORE:I][12.549137] Memory created [memory] [API:I][12.549158] Memory create [CORE:V0][12.549142] Memory desc init by tag [memory] [CORE:I][12.549147] Memory created [memory] [API:I][12.549082] matmul desc create - no bias [CORE:I][12.549080] matmul desc init [matmul] [API:I][12.549103] matmul primitive_desc create - attr [PROF:I][12.548888] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00321,ms [API:I][12.549120] matmul primitive create [CORE:I][12.549077] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.549081] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.534353] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.540328] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=5.975ms graph_exe_count=-1 weight_address=0x70dfd5fe5040 [PROF:I][12.554914] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.01353,ms [API:I][12.555399] Memory create [CORE:V0][12.555389] Memory desc init by tag [memory] [CORE:I][12.555397] Memory created [memory] [API:I][12.555419] Memory create - strides [CORE:I][12.555405] Memory desc init by Stride [memory] [CORE:I][12.555409] Memory created [memory] [API:I][12.555430] Memory create [CORE:V0][12.555415] Memory desc init by tag [memory] [CORE:I][12.555419] Memory created [memory] [API:I][12.555441] Memory create [CORE:V0][12.555428] Memory desc init by tag [memory] [CORE:I][12.555432] Memory created [memory] [API:I][12.555455] Memory create - strides [CORE:I][12.555440] Memory desc init by Stride [memory] [CORE:I][12.555444] Memory created [memory] [API:I][12.555467] Memory create [CORE:V0][12.555452] Memory desc init by tag [memory] [CORE:I][12.555455] Memory created [memory] [API:I][12.555478] Memory create [CORE:V0][12.555465] Memory desc init by tag [memory] [CORE:I][12.555469] Memory created [memory] [API:I][12.555492] Memory create - strides [CORE:I][12.555477] Memory desc init by Stride [memory] [CORE:I][12.555481] Memory created [memory] [API:I][12.555503] Memory create [CORE:V0][12.555488] Memory desc init by tag [memory] [CORE:I][12.555492] Memory created [memory] [API:I][12.540748] CPU Engine create [CORE:V0][12.555570] CPU Engine created [engine] [CORE:I][12.555574] CPU Engine created [cpu/engine] [API:I][12.540759] CPU Stream create [CORE:I][12.555181] CPU Stream created [stream] [CORE:V0][12.555181] CPU Stream created [cpu/stream] [API:I][12.540779] matmul desc create - no bias [CORE:I][12.555450] matmul desc init [matmul] [API:I][12.540799] matmul primitive_desc create - attr [PROF:I][12.555259] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00544,ms [API:I][12.540819] matmul primitive create [CORE:I][12.555447] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.555451] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.540726] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.542462] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.737ms graph_exe_count=-1 weight_address=0x70db2ddf9040 [PROF:I][12.557032] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.76211,ms [API:I][12.542595] matmul desc create - no bias [CORE:I][12.557265] matmul desc init [matmul] [API:I][12.542607] matmul primitive_desc create - attr [PROF:I][12.557061] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00189,ms [API:I][12.542620] matmul primitive create [CORE:I][12.557245] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.557248] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.542519] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.542941] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.422ms graph_exe_count=-1 weight_address=0x371d68c0 [PROF:I][12.557512] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.442274,ms [API:I][12.543072] matmul desc create - no bias [CORE:I][12.557742] matmul desc init [matmul] [API:I][12.543080] matmul primitive_desc create - attr [PROF:I][12.557531] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00096,ms [API:I][12.543090] matmul primitive create [CORE:I][12.557715] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.557719] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.542989] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.543332] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.343ms graph_exe_count=-1 weight_address=0x381d6900 [PROF:I][12.557903] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.362542,ms [PROF:V0][12.543463] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.71484,ms [CORE:I][12.557887] CPU Stream deleted [stream] [CORE:I][12.558293] CPU Engine deleted [engine] [API:I][12.558459] Memory create [CORE:V0][12.558449] Memory desc init by tag [memory] [CORE:I][12.558455] Memory created [memory] [API:I][12.558477] Memory create - strides [CORE:I][12.558463] Memory desc init by Stride [memory] [CORE:I][12.558466] Memory created [memory] [API:I][12.558487] Memory create [CORE:V0][12.558472] Memory desc init by tag [memory] [CORE:I][12.558477] Memory created [memory] [API:I][12.558408] matmul desc create - no bias [CORE:I][12.558406] matmul desc init [matmul] [API:I][12.558422] matmul primitive_desc create - attr [PROF:I][12.558203] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00215,ms [API:I][12.558435] matmul primitive create [CORE:I][12.558391] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.558394] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.543666] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.545432] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.767ms graph_exe_count=-1 weight_address=0x70db31dfa040 [PROF:I][12.560003] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.78821,ms [API:I][12.560372] Memory create [CORE:V0][12.560359] Memory desc init by tag [memory] [CORE:I][12.560364] Memory created [memory] [API:I][12.560385] Memory create - strides [CORE:I][12.560371] Memory desc init by Stride [memory] [CORE:I][12.560376] Memory created [memory] [API:I][12.560398] Memory create [CORE:V0][12.560383] Memory desc init by tag [memory] [CORE:I][12.560387] Memory created [memory] [API:I][12.560315] matmul desc create - no bias [CORE:I][12.560313] matmul desc init [matmul] [API:I][12.560328] matmul primitive_desc create - attr [PROF:I][12.560108] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00156,ms [API:I][12.560340] matmul primitive create [CORE:I][12.560294] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.560298] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.545568] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.552110] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.541ms graph_exe_count=-1 weight_address=0x70db35dfb040 [PROF:I][12.566696] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.57725,ms [API:I][12.567104] Memory create [CORE:V0][12.567093] Memory desc init by tag [memory] [CORE:I][12.567102] Memory created [memory] [API:I][12.567123] Memory create - strides [CORE:I][12.567112] Memory desc init by Stride [memory] [CORE:I][12.567115] Memory created [memory] [API:I][12.567136] Memory create [CORE:V0][12.567122] Memory desc init by tag [memory] [CORE:I][12.567127] Memory created [memory] [API:I][12.567153] Memory create [CORE:V0][12.567138] Memory desc init by tag [memory] [CORE:I][12.567142] Memory created [memory] [API:I][12.567166] Memory create [CORE:V0][12.567152] Memory desc init by tag [memory] [CORE:I][12.567156] Memory created [memory] [API:I][12.567089] matmul desc create - no bias [CORE:I][12.567088] matmul desc init [matmul] [API:I][12.567109] matmul primitive_desc create - attr [PROF:I][12.566894] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00306,ms [API:I][12.567127] matmul primitive create [CORE:I][12.567083] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.567086] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.552361] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.558633] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.275ms graph_exe_count=-1 weight_address=0x70db43dfc040 [PROF:I][12.573219] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.3126,ms [API:I][12.573639] Memory create [CORE:V0][12.573627] Memory desc init by tag [memory] [CORE:I][12.573636] Memory created [memory] [API:I][12.573658] Memory create - strides [CORE:I][12.573645] Memory desc init by Stride [memory] [CORE:I][12.573651] Memory created [memory] [API:I][12.573672] Memory create [CORE:V0][12.573658] Memory desc init by tag [memory] [CORE:I][12.573663] Memory created [memory] [API:I][12.573595] matmul desc create - no bias [CORE:I][12.573593] matmul desc init [matmul] [API:I][12.573615] matmul primitive_desc create - attr [PROF:I][12.573401] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00396,ms [API:I][12.573633] matmul primitive create [CORE:I][12.573590] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.573594] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.558867] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.564930] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.063ms graph_exe_count=-1 weight_address=0x70db51dfd040 [PROF:I][12.579516] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.10243,ms [API:I][12.580008] Memory create [CORE:V0][12.579998] Memory desc init by tag [memory] [CORE:I][12.580007] Memory created [memory] [API:I][12.580029] Memory create - strides [CORE:I][12.580018] Memory desc init by Stride [memory] [CORE:I][12.580021] Memory created [memory] [API:I][12.580044] Memory create [CORE:V0][12.580029] Memory desc init by tag [memory] [CORE:I][12.580033] Memory created [memory] [API:I][12.580056] Memory create [CORE:V0][12.580041] Memory desc init by tag [memory] [CORE:I][12.580046] Memory created [memory] [API:I][12.580068] Memory create - strides [CORE:I][12.580053] Memory desc init by Stride [memory] [CORE:I][12.580057] Memory created [memory] [API:I][12.580080] Memory create [CORE:V0][12.580065] Memory desc init by tag [memory] [CORE:I][12.580069] Memory created [memory] [API:I][12.580091] Memory create [CORE:V0][12.580076] Memory desc init by tag [memory] [CORE:I][12.580081] Memory created [memory] [API:I][12.580104] Memory create - strides [CORE:I][12.580089] Memory desc init by Stride [memory] [CORE:I][12.580093] Memory created [memory] [API:I][12.580114] Memory create [CORE:V0][12.580100] Memory desc init by tag [memory] [CORE:I][12.580103] Memory created [memory] [API:I][12.565360] CPU Engine create [CORE:V0][12.580183] CPU Engine created [engine] [CORE:I][12.580187] CPU Engine created [cpu/engine] [API:I][12.565373] CPU Stream create [CORE:I][12.579795] CPU Stream created [stream] [CORE:V0][12.579795] CPU Stream created [cpu/stream] [API:I][12.565393] matmul desc create - no bias [CORE:I][12.580064] matmul desc init [matmul] [API:I][12.565414] matmul primitive_desc create - attr [PROF:I][12.579871] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00366,ms [API:I][12.565431] matmul primitive create [CORE:I][12.580060] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.580065] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.565339] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.567002] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.663ms graph_exe_count=-1 weight_address=0x70dafbdf4040 [PROF:I][12.581572] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.6888,ms [API:I][12.567135] matmul desc create - no bias [CORE:I][12.581806] matmul desc init [matmul] [API:I][12.567148] matmul primitive_desc create - attr [PROF:I][12.581601] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00142,ms [API:I][12.567161] matmul primitive create [CORE:I][12.581787] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.581790] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.567061] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.567545] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.484ms graph_exe_count=-1 weight_address=0x391de9c0 [PROF:I][12.582115] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.504137,ms [API:I][12.567677] matmul desc create - no bias [CORE:I][12.582347] matmul desc init [matmul] [API:I][12.567686] matmul primitive_desc create - attr [PROF:I][12.582138] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00095,ms [API:I][12.567697] matmul primitive create [CORE:I][12.582322] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.582325] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.567596] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.567919] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.324ms graph_exe_count=-1 weight_address=0x3a1dea00 [PROF:I][12.582488] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.341641,ms [PROF:V0][12.568048] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.68799,ms [CORE:I][12.582472] CPU Stream deleted [stream] [CORE:I][12.582877] CPU Engine deleted [engine] [API:I][12.583055] Memory create [CORE:V0][12.583044] Memory desc init by tag [memory] [CORE:I][12.583050] Memory created [memory] [API:I][12.583071] Memory create - strides [CORE:I][12.583057] Memory desc init by Stride [memory] [CORE:I][12.583060] Memory created [memory] [API:I][12.583081] Memory create [CORE:V0][12.583067] Memory desc init by tag [memory] [CORE:I][12.583072] Memory created [memory] [API:I][12.583003] matmul desc create - no bias [CORE:I][12.583000] matmul desc init [matmul] [API:I][12.583018] matmul primitive_desc create - attr [PROF:I][12.582799] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00209,ms [API:I][12.583031] matmul primitive create [CORE:I][12.582986] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.582989] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.568261] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.569954] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.694ms graph_exe_count=-1 weight_address=0x70daffdf5040 [PROF:I][12.584525] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.71502,ms [API:I][12.584891] Memory create [CORE:V0][12.584878] Memory desc init by tag [memory] [CORE:I][12.584883] Memory created [memory] [API:I][12.584904] Memory create - strides [CORE:I][12.584890] Memory desc init by Stride [memory] [CORE:I][12.584894] Memory created [memory] [API:I][12.584915] Memory create [CORE:V0][12.584901] Memory desc init by tag [memory] [CORE:I][12.584905] Memory created [memory] [API:I][12.584834] matmul desc create - no bias [CORE:I][12.584831] matmul desc init [matmul] [API:I][12.584847] matmul primitive_desc create - attr [PROF:I][12.584627] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00142,ms [API:I][12.584859] matmul primitive create [CORE:I][12.584815] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.584818] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.570098] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.576433] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.335ms graph_exe_count=-1 weight_address=0x70db03df6040 [PROF:I][12.591020] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.38152,ms [API:I][12.591439] Memory create [CORE:V0][12.591427] Memory desc init by tag [memory] [CORE:I][12.591436] Memory created [memory] [API:I][12.591458] Memory create - strides [CORE:I][12.591445] Memory desc init by Stride [memory] [CORE:I][12.591450] Memory created [memory] [API:I][12.591472] Memory create [CORE:V0][12.591458] Memory desc init by tag [memory] [CORE:I][12.591463] Memory created [memory] [API:I][12.591487] Memory create [CORE:V0][12.591473] Memory desc init by tag [memory] [CORE:I][12.591477] Memory created [memory] [API:I][12.591504] Memory create [CORE:V0][12.591489] Memory desc init by tag [memory] [CORE:I][12.591493] Memory created [memory] [API:I][12.591426] matmul desc create - no bias [CORE:I][12.591425] matmul desc init [matmul] [API:I][12.591448] matmul primitive_desc create - attr [PROF:I][12.591234] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00384,ms [API:I][12.591467] matmul primitive create [CORE:I][12.591423] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.591427] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.576701] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.583197] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.497ms graph_exe_count=-1 weight_address=0x70db11df7040 [PROF:I][12.597782] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.53533,ms [API:I][12.598201] Memory create [CORE:V0][12.598189] Memory desc init by tag [memory] [CORE:I][12.598197] Memory created [memory] [API:I][12.598219] Memory create - strides [CORE:I][12.598205] Memory desc init by Stride [memory] [CORE:I][12.598209] Memory created [memory] [API:I][12.598229] Memory create [CORE:V0][12.598214] Memory desc init by tag [memory] [CORE:I][12.598220] Memory created [memory] [API:I][12.598153] matmul desc create - no bias [CORE:I][12.598152] matmul desc init [matmul] [API:I][12.598175] matmul primitive_desc create - attr [PROF:I][12.597959] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00336,ms [API:I][12.598192] matmul primitive create [CORE:I][12.598151] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.598156] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.583430] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.589510] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.081ms graph_exe_count=-1 weight_address=0x70db1fdf8040 [PROF:I][12.604094] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.12018,ms [API:I][12.604570] Memory create [CORE:V0][12.604560] Memory desc init by tag [memory] [CORE:I][12.604570] Memory created [memory] [API:I][12.604592] Memory create - strides [CORE:I][12.604578] Memory desc init by Stride [memory] [CORE:I][12.604583] Memory created [memory] [API:I][12.604604] Memory create [CORE:V0][12.604589] Memory desc init by tag [memory] [CORE:I][12.604595] Memory created [memory] [API:I][12.604618] Memory create [CORE:V0][12.604603] Memory desc init by tag [memory] [CORE:I][12.604607] Memory created [memory] [API:I][12.604629] Memory create - strides [CORE:I][12.604615] Memory desc init by Stride [memory] [CORE:I][12.604619] Memory created [memory] [API:I][12.604642] Memory create [CORE:V0][12.604628] Memory desc init by tag [memory] [CORE:I][12.604632] Memory created [memory] [API:I][12.604654] Memory create [CORE:V0][12.604641] Memory desc init by tag [memory] [CORE:I][12.604646] Memory created [memory] [API:I][12.604669] Memory create - strides [CORE:I][12.604654] Memory desc init by Stride [memory] [CORE:I][12.604659] Memory created [memory] [API:I][12.604680] Memory create [CORE:V0][12.604666] Memory desc init by tag [memory] [CORE:I][12.604670] Memory created [memory] [API:I][12.589926] CPU Engine create [CORE:V0][12.604748] CPU Engine created [engine] [CORE:I][12.604753] CPU Engine created [cpu/engine] [API:I][12.589939] CPU Stream create [CORE:I][12.604360] CPU Stream created [stream] [CORE:V0][12.604359] CPU Stream created [cpu/stream] [API:I][12.589957] matmul desc create - no bias [CORE:I][12.604627] matmul desc init [matmul] [API:I][12.589975] matmul primitive_desc create - attr [PROF:I][12.604433] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00343,ms [API:I][12.589993] matmul primitive create [CORE:I][12.604622] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.604626] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.589900] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.591600] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.701ms graph_exe_count=-1 weight_address=0x70dac9def040 [PROF:I][12.606173] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.72813,ms [API:I][12.591738] matmul desc create - no bias [CORE:I][12.606409] matmul desc init [matmul] [API:I][12.591753] matmul primitive_desc create - attr [PROF:I][12.606207] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00177,ms [API:I][12.591766] matmul primitive create [CORE:I][12.606393] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.606396] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.591668] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.592136] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.468ms graph_exe_count=-1 weight_address=0x3b1e6ac0 [PROF:I][12.606707] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.489746,ms [API:I][12.592268] matmul desc create - no bias [CORE:I][12.606938] matmul desc init [matmul] [API:I][12.592277] matmul primitive_desc create - attr [PROF:I][12.606729] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00116,ms [API:I][12.592288] matmul primitive create [CORE:I][12.606914] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.606917] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.592188] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.592518] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.33ms graph_exe_count=-1 weight_address=0x3c1e6b00 [PROF:I][12.607088] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.349531,ms [PROF:V0][12.592649] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.72192,ms [CORE:I][12.607073] CPU Stream deleted [stream] [CORE:I][12.607478] CPU Engine deleted [engine] [API:I][12.607650] Memory create [CORE:V0][12.607639] Memory desc init by tag [memory] [CORE:I][12.607645] Memory created [memory] [API:I][12.607667] Memory create - strides [CORE:I][12.607654] Memory desc init by Stride [memory] [CORE:I][12.607657] Memory created [memory] [API:I][12.607679] Memory create [CORE:V0][12.607665] Memory desc init by tag [memory] [CORE:I][12.607670] Memory created [memory] [API:I][12.607600] matmul desc create - no bias [CORE:I][12.607598] matmul desc init [matmul] [API:I][12.607615] matmul primitive_desc create - attr [PROF:I][12.607396] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00194,ms [API:I][12.607628] matmul primitive create [CORE:I][12.607583] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.607586] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.592858] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.594527] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.668ms graph_exe_count=-1 weight_address=0x70dacddf0040 [PROF:I][12.609099] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.69162,ms [API:I][12.609474] Memory create [CORE:V0][12.609462] Memory desc init by tag [memory] [CORE:I][12.609467] Memory created [memory] [API:I][12.609489] Memory create - strides [CORE:I][12.609475] Memory desc init by Stride [memory] [CORE:I][12.609478] Memory created [memory] [API:I][12.609499] Memory create [CORE:V0][12.609485] Memory desc init by tag [memory] [CORE:I][12.609489] Memory created [memory] [API:I][12.609418] matmul desc create - no bias [CORE:I][12.609416] matmul desc init [matmul] [API:I][12.609431] matmul primitive_desc create - attr [PROF:I][12.609212] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00163,ms [API:I][12.609443] matmul primitive create [CORE:I][12.609398] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.609402] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.594673] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.601255] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.583ms graph_exe_count=-1 weight_address=0x70dad1df1040 [PROF:I][12.615840] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.61717,ms [API:I][12.616246] Memory create [CORE:V0][12.616234] Memory desc init by tag [memory] [CORE:I][12.616242] Memory created [memory] [API:I][12.616264] Memory create - strides [CORE:I][12.616250] Memory desc init by Stride [memory] [CORE:I][12.616254] Memory created [memory] [API:I][12.616274] Memory create [CORE:V0][12.616259] Memory desc init by tag [memory] [CORE:I][12.616264] Memory created [memory] [API:I][12.616288] Memory create [CORE:V0][12.616274] Memory desc init by tag [memory] [CORE:I][12.616278] Memory created [memory] [API:I][12.616305] Memory create [CORE:V0][12.616290] Memory desc init by tag [memory] [CORE:I][12.616294] Memory created [memory] [API:I][12.616228] matmul desc create - no bias [CORE:I][12.616227] matmul desc init [matmul] [API:I][12.616247] matmul primitive_desc create - attr [PROF:I][12.616032] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00362,ms [API:I][12.616265] matmul primitive create [CORE:I][12.616221] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.616225] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.601499] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.607860] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.362ms graph_exe_count=-1 weight_address=0x70dadfdf2040 [PROF:I][12.622444] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.3986,ms [API:I][12.622855] Memory create [CORE:V0][12.622843] Memory desc init by tag [memory] [CORE:I][12.622851] Memory created [memory] [API:I][12.622872] Memory create - strides [CORE:I][12.622858] Memory desc init by Stride [memory] [CORE:I][12.622861] Memory created [memory] [API:I][12.622882] Memory create [CORE:V0][12.622867] Memory desc init by tag [memory] [CORE:I][12.622874] Memory created [memory] [API:I][12.622805] matmul desc create - no bias [CORE:I][12.622803] matmul desc init [matmul] [API:I][12.622825] matmul primitive_desc create - attr [PROF:I][12.622610] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00325,ms [API:I][12.622843] matmul primitive create [CORE:I][12.622799] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.622803] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.608076] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.614206] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.13ms graph_exe_count=-1 weight_address=0x70daeddf3040 [PROF:I][12.628793] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.16985,ms [API:I][12.629276] Memory create [CORE:V0][12.629267] Memory desc init by tag [memory] [CORE:I][12.629279] Memory created [memory] [API:I][12.629301] Memory create - strides [CORE:I][12.629287] Memory desc init by Stride [memory] [CORE:I][12.629291] Memory created [memory] [API:I][12.629312] Memory create [CORE:V0][12.629297] Memory desc init by tag [memory] [CORE:I][12.629302] Memory created [memory] [API:I][12.629325] Memory create [CORE:V0][12.629310] Memory desc init by tag [memory] [CORE:I][12.629315] Memory created [memory] [API:I][12.629336] Memory create - strides [CORE:I][12.629323] Memory desc init by Stride [memory] [CORE:I][12.629327] Memory created [memory] [API:I][12.629351] Memory create [CORE:V0][12.629337] Memory desc init by tag [memory] [CORE:I][12.629343] Memory created [memory] [API:I][12.629365] Memory create [CORE:V0][12.629350] Memory desc init by tag [memory] [CORE:I][12.629354] Memory created [memory] [API:I][12.629375] Memory create - strides [CORE:I][12.629361] Memory desc init by Stride [memory] [CORE:I][12.629365] Memory created [memory] [API:I][12.629386] Memory create [CORE:V0][12.629372] Memory desc init by tag [memory] [CORE:I][12.629376] Memory created [memory] [API:I][12.614631] CPU Engine create [CORE:V0][12.629454] CPU Engine created [engine] [CORE:I][12.629458] CPU Engine created [cpu/engine] [API:I][12.614644] CPU Stream create [CORE:I][12.629065] CPU Stream created [stream] [CORE:V0][12.629064] CPU Stream created [cpu/stream] [API:I][12.614662] matmul desc create - no bias [CORE:I][12.629334] matmul desc init [matmul] [API:I][12.614683] matmul primitive_desc create - attr [PROF:I][12.629140] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.003581,ms [API:I][12.614701] matmul primitive create [CORE:I][12.629330] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.629334] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.614609] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.616391] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.783ms graph_exe_count=-1 weight_address=0x70da97dea040 [PROF:I][12.630963] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.80918,ms [API:I][12.616526] matmul desc create - no bias [CORE:I][12.631196] matmul desc init [matmul] [API:I][12.616538] matmul primitive_desc create - attr [PROF:I][12.630990] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00128,ms [API:I][12.616549] matmul primitive create [CORE:I][12.631175] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.631178] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.616450] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.616899] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.449ms graph_exe_count=-1 weight_address=0x3d1eebc0 [PROF:I][12.631469] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.469075,ms [API:I][12.617029] matmul desc create - no bias [CORE:I][12.631699] matmul desc init [matmul] [API:I][12.617037] matmul primitive_desc create - attr [PROF:I][12.631488] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00079,ms [API:I][12.617047] matmul primitive create [CORE:I][12.631673] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.631676] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.616946] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.617281] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.335ms graph_exe_count=-1 weight_address=0x3e1eec00 [PROF:I][12.631852] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.354382,ms [PROF:V0][12.617412] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.78101,ms [CORE:I][12.631836] CPU Stream deleted [stream] [CORE:I][12.632241] CPU Engine deleted [engine] [API:I][12.632408] Memory create [CORE:V0][12.632397] Memory desc init by tag [memory] [CORE:I][12.632403] Memory created [memory] [API:I][12.632425] Memory create - strides [CORE:I][12.632411] Memory desc init by Stride [memory] [CORE:I][12.632415] Memory created [memory] [API:I][12.632435] Memory create [CORE:V0][12.632420] Memory desc init by tag [memory] [CORE:I][12.632425] Memory created [memory] [API:I][12.632356] matmul desc create - no bias [CORE:I][12.632353] matmul desc init [matmul] [API:I][12.632371] matmul primitive_desc create - attr [PROF:I][12.632152] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00212,ms [API:I][12.632384] matmul primitive create [CORE:I][12.632339] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.632343] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.617614] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.619310] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.697ms graph_exe_count=-1 weight_address=0x70da9bdeb040 [PROF:I][12.633881] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.71747,ms [API:I][12.634247] Memory create [CORE:V0][12.634234] Memory desc init by tag [memory] [CORE:I][12.634239] Memory created [memory] [API:I][12.634260] Memory create - strides [CORE:I][12.634245] Memory desc init by Stride [memory] [CORE:I][12.634250] Memory created [memory] [API:I][12.634271] Memory create [CORE:V0][12.634257] Memory desc init by tag [memory] [CORE:I][12.634261] Memory created [memory] [API:I][12.634191] matmul desc create - no bias [CORE:I][12.634188] matmul desc init [matmul] [API:I][12.634203] matmul primitive_desc create - attr [PROF:I][12.633984] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00131,ms [API:I][12.634217] matmul primitive create [CORE:I][12.634171] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.634174] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.619445] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.625792] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.347ms graph_exe_count=-1 weight_address=0x70da9fdec040 [PROF:I][12.640380] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.38491,ms [API:I][12.640799] Memory create [CORE:V0][12.640787] Memory desc init by tag [memory] [CORE:I][12.640798] Memory created [memory] [API:I][12.640819] Memory create - strides [CORE:I][12.640805] Memory desc init by Stride [memory] [CORE:I][12.640809] Memory created [memory] [API:I][12.640830] Memory create [CORE:V0][12.640816] Memory desc init by tag [memory] [CORE:I][12.640821] Memory created [memory] [API:I][12.640846] Memory create [CORE:V0][12.640831] Memory desc init by tag [memory] [CORE:I][12.640836] Memory created [memory] [API:I][12.640862] Memory create [CORE:V0][12.640847] Memory desc init by tag [memory] [CORE:I][12.640851] Memory created [memory] [API:I][12.640784] matmul desc create - no bias [CORE:I][12.640782] matmul desc init [matmul] [API:I][12.640804] matmul primitive_desc create - attr [PROF:I][12.640590] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00441,ms [API:I][12.640823] matmul primitive create [CORE:I][12.640779] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.640783] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.626058] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.632249] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.194ms graph_exe_count=-1 weight_address=0x70daadded040 [PROF:I][12.646834] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.23128,ms [API:I][12.647255] Memory create [CORE:V0][12.647242] Memory desc init by tag [memory] [CORE:I][12.647250] Memory created [memory] [API:I][12.647272] Memory create - strides [CORE:I][12.647258] Memory desc init by Stride [memory] [CORE:I][12.647261] Memory created [memory] [API:I][12.647282] Memory create [CORE:V0][12.647267] Memory desc init by tag [memory] [CORE:I][12.647272] Memory created [memory] [API:I][12.647206] matmul desc create - no bias [CORE:I][12.647204] matmul desc init [matmul] [API:I][12.647228] matmul primitive_desc create - attr [PROF:I][12.647013] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.003411,ms [API:I][12.647245] matmul primitive create [CORE:I][12.647205] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.647208] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.632482] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.638874] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.392ms graph_exe_count=-1 weight_address=0x70dabbdee040 [PROF:I][12.653459] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.43213,ms [API:I][12.653940] Memory create [CORE:V0][12.653932] Memory desc init by tag [memory] [CORE:I][12.653940] Memory created [memory] [API:I][12.653962] Memory create - strides [CORE:I][12.653959] Memory desc init by Stride [memory] [CORE:I][12.653962] Memory created [memory] [API:I][12.653984] Memory create [CORE:V0][12.653970] Memory desc init by tag [memory] [CORE:I][12.653975] Memory created [memory] [API:I][12.653998] Memory create [CORE:V0][12.653984] Memory desc init by tag [memory] [CORE:I][12.653988] Memory created [memory] [API:I][12.654009] Memory create - strides [CORE:I][12.653995] Memory desc init by Stride [memory] [CORE:I][12.653999] Memory created [memory] [API:I][12.654022] Memory create [CORE:V0][12.654007] Memory desc init by tag [memory] [CORE:I][12.654011] Memory created [memory] [API:I][12.654033] Memory create [CORE:V0][12.654018] Memory desc init by tag [memory] [CORE:I][12.654023] Memory created [memory] [API:I][12.654044] Memory create - strides [CORE:I][12.654029] Memory desc init by Stride [memory] [CORE:I][12.654034] Memory created [memory] [API:I][12.654055] Memory create [CORE:V0][12.654040] Memory desc init by tag [memory] [CORE:I][12.654044] Memory created [memory] [API:I][12.639300] CPU Engine create [CORE:V0][12.654122] CPU Engine created [engine] [CORE:I][12.654126] CPU Engine created [cpu/engine] [API:I][12.639311] CPU Stream create [CORE:I][12.653733] CPU Stream created [stream] [CORE:V0][12.653732] CPU Stream created [cpu/stream] [API:I][12.639331] matmul desc create - no bias [CORE:I][12.654002] matmul desc init [matmul] [API:I][12.639351] matmul primitive_desc create - attr [PROF:I][12.653808] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00388,ms [API:I][12.639368] matmul primitive create [CORE:I][12.653998] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.654002] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.639277] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.640941] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.665ms graph_exe_count=-1 weight_address=0x70da65de5040 [PROF:I][12.655512] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.69121,ms [API:I][12.641075] matmul desc create - no bias [CORE:I][12.655745] matmul desc init [matmul] [API:I][12.641088] matmul primitive_desc create - attr [PROF:I][12.655540] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00158,ms [API:I][12.641099] matmul primitive create [CORE:I][12.655725] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.655728] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.640999] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.641470] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.471ms graph_exe_count=-1 weight_address=0x3f1f6cc0 [PROF:I][12.656040] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.490257,ms [API:I][12.641600] matmul desc create - no bias [CORE:I][12.656270] matmul desc init [matmul] [API:I][12.641608] matmul primitive_desc create - attr [PROF:I][12.656059] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00087,ms [API:I][12.641619] matmul primitive create [CORE:I][12.656244] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.656248] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.641519] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.641846] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.327ms graph_exe_count=-1 weight_address=0x401f6d00 [PROF:I][12.656415] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.346501,ms [PROF:V0][12.641975] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.67603,ms [CORE:I][12.656399] CPU Stream deleted [stream] [CORE:I][12.656803] CPU Engine deleted [engine] [API:I][12.656975] Memory create [CORE:V0][12.656964] Memory desc init by tag [memory] [CORE:I][12.656970] Memory created [memory] [API:I][12.656992] Memory create - strides [CORE:I][12.656977] Memory desc init by Stride [memory] [CORE:I][12.656982] Memory created [memory] [API:I][12.657003] Memory create [CORE:V0][12.656990] Memory desc init by tag [memory] [CORE:I][12.656995] Memory created [memory] [API:I][12.656926] matmul desc create - no bias [CORE:I][12.656923] matmul desc init [matmul] [API:I][12.656940] matmul primitive_desc create - attr [PROF:I][12.656722] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00192,ms [API:I][12.656956] matmul primitive create [CORE:I][12.656910] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.656914] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.642186] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.643874] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.689ms graph_exe_count=-1 weight_address=0x70da69de6040 [PROF:I][12.658445] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.71078,ms [API:I][12.658814] Memory create [CORE:V0][12.658802] Memory desc init by tag [memory] [CORE:I][12.658807] Memory created [memory] [API:I][12.658828] Memory create - strides [CORE:I][12.658814] Memory desc init by Stride [memory] [CORE:I][12.658818] Memory created [memory] [API:I][12.658839] Memory create [CORE:V0][12.658825] Memory desc init by tag [memory] [CORE:I][12.658829] Memory created [memory] [API:I][12.658758] matmul desc create - no bias [CORE:I][12.658756] matmul desc init [matmul] [API:I][12.658772] matmul primitive_desc create - attr [PROF:I][12.658551] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00131,ms [API:I][12.658783] matmul primitive create [CORE:I][12.658738] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.658741] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.644012] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.650453] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.441ms graph_exe_count=-1 weight_address=0x70da6dde7040 [PROF:I][12.665039] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.47634,ms [API:I][12.665449] Memory create [CORE:V0][12.665436] Memory desc init by tag [memory] [CORE:I][12.665445] Memory created [memory] [API:I][12.665467] Memory create - strides [CORE:I][12.665452] Memory desc init by Stride [memory] [CORE:I][12.665457] Memory created [memory] [API:I][12.665478] Memory create [CORE:V0][12.665464] Memory desc init by tag [memory] [CORE:I][12.665469] Memory created [memory] [API:I][12.665498] Memory create [CORE:V0][12.665484] Memory desc init by tag [memory] [CORE:I][12.665489] Memory created [memory] [API:I][12.665512] Memory create [CORE:V0][12.665499] Memory desc init by tag [memory] [CORE:I][12.665503] Memory created [memory] [API:I][12.665437] matmul desc create - no bias [CORE:I][12.665435] matmul desc init [matmul] [API:I][12.665458] matmul primitive_desc create - attr [PROF:I][12.665242] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00359,ms [API:I][12.665475] matmul primitive create [CORE:I][12.665434] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.665438] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.650711] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.656900] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.191ms graph_exe_count=-1 weight_address=0x70da7bde8040 [PROF:I][12.671485] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.2279,ms [API:I][12.671900] Memory create [CORE:V0][12.671888] Memory desc init by tag [memory] [CORE:I][12.671897] Memory created [memory] [API:I][12.671919] Memory create - strides [CORE:I][12.671904] Memory desc init by Stride [memory] [CORE:I][12.671907] Memory created [memory] [API:I][12.671929] Memory create [CORE:V0][12.671914] Memory desc init by tag [memory] [CORE:I][12.671919] Memory created [memory] [API:I][12.671853] matmul desc create - no bias [CORE:I][12.671851] matmul desc init [matmul] [API:I][12.671881] matmul primitive_desc create - attr [PROF:I][12.671667] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.0043,ms [API:I][12.671899] matmul primitive create [CORE:I][12.671857] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.671860] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.657134] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.663723] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.589ms graph_exe_count=-1 weight_address=0x70da89de9040 [PROF:I][12.678309] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.62859,ms [API:I][12.678786] Memory create [CORE:V0][12.678777] Memory desc init by tag [memory] [CORE:I][12.678787] Memory created [memory] [API:I][12.678809] Memory create - strides [CORE:I][12.678795] Memory desc init by Stride [memory] [CORE:I][12.678799] Memory created [memory] [API:I][12.678821] Memory create [CORE:V0][12.678806] Memory desc init by tag [memory] [CORE:I][12.678810] Memory created [memory] [API:I][12.678833] Memory create [CORE:V0][12.678818] Memory desc init by tag [memory] [CORE:I][12.678822] Memory created [memory] [API:I][12.678844] Memory create - strides [CORE:I][12.678829] Memory desc init by Stride [memory] [CORE:I][12.678834] Memory created [memory] [API:I][12.678858] Memory create [CORE:V0][12.678843] Memory desc init by tag [memory] [CORE:I][12.678847] Memory created [memory] [API:I][12.678869] Memory create [CORE:V0][12.678855] Memory desc init by tag [memory] [CORE:I][12.678859] Memory created [memory] [API:I][12.678880] Memory create - strides [CORE:I][12.678865] Memory desc init by Stride [memory] [CORE:I][12.678869] Memory created [memory] [API:I][12.678891] Memory create [CORE:V0][12.678876] Memory desc init by tag [memory] [CORE:I][12.678881] Memory created [memory] [API:I][12.664136] CPU Engine create [CORE:V0][12.678959] CPU Engine created [engine] [CORE:I][12.678962] CPU Engine created [cpu/engine] [API:I][12.664148] CPU Stream create [CORE:I][12.678570] CPU Stream created [stream] [CORE:V0][12.678569] CPU Stream created [cpu/stream] [API:I][12.664167] matmul desc create - no bias [CORE:I][12.678838] matmul desc init [matmul] [API:I][12.664187] matmul primitive_desc create - attr [PROF:I][12.678653] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00328,ms [API:I][12.664213] matmul primitive create [CORE:I][12.678841] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.678845] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.664121] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.665863] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.743ms graph_exe_count=-1 weight_address=0x70da5dde3040 [PROF:I][12.680434] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.76988,ms [API:I][12.665996] matmul desc create - no bias [CORE:I][12.680666] matmul desc init [matmul] [API:I][12.666008] matmul primitive_desc create - attr [PROF:I][12.680461] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00133,ms [API:I][12.666020] matmul primitive create [CORE:I][12.680645] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.680649] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.665920] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.666386] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.467ms graph_exe_count=-1 weight_address=0x411f6d40 [PROF:I][12.680956] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.486526,ms [API:I][12.666517] matmul desc create - no bias [CORE:I][12.681187] matmul desc init [matmul] [API:I][12.666526] matmul primitive_desc create - attr [PROF:I][12.680977] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00086,ms [API:I][12.666536] matmul primitive create [CORE:I][12.681162] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.681165] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.666435] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.666758] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.323ms graph_exe_count=-1 weight_address=0x421f6d80 [PROF:I][12.681328] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.342081,ms [PROF:V0][12.666889] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.75195,ms [CORE:I][12.681313] CPU Stream deleted [stream] [CORE:I][12.681717] CPU Engine deleted [engine] [API:I][12.681884] Memory create [CORE:V0][12.681872] Memory desc init by tag [memory] [CORE:I][12.681878] Memory created [memory] [API:I][12.681900] Memory create - strides [CORE:I][12.681886] Memory desc init by Stride [memory] [CORE:I][12.681889] Memory created [memory] [API:I][12.681911] Memory create [CORE:V0][12.681898] Memory desc init by tag [memory] [CORE:I][12.681902] Memory created [memory] [API:I][12.681833] matmul desc create - no bias [CORE:I][12.681830] matmul desc init [matmul] [API:I][12.681848] matmul primitive_desc create - attr [PROF:I][12.681630] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00184,ms [API:I][12.681861] matmul primitive create [CORE:I][12.681817] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.681834] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.667106] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.668652] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.547ms graph_exe_count=-1 weight_address=0x70da61de4040 [PROF:I][12.683222] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.58176,ms [API:I][12.683587] Memory create [CORE:V0][12.683574] Memory desc init by tag [memory] [CORE:I][12.683578] Memory created [memory] [API:I][12.683599] Memory create - strides [CORE:I][12.683586] Memory desc init by Stride [memory] [CORE:I][12.683590] Memory created [memory] [API:I][12.683611] Memory create [CORE:V0][12.683597] Memory desc init by tag [memory] [CORE:I][12.683601] Memory created [memory] [API:I][12.683530] matmul desc create - no bias [CORE:I][12.683527] matmul desc init [matmul] [API:I][12.683542] matmul primitive_desc create - attr [PROF:I][12.683322] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00142,ms [API:I][12.683553] matmul primitive create [CORE:I][12.683508] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.683512] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.668783] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.675030] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.246ms graph_exe_count=-1 weight_address=0x70dc52bc6040 [PROF:I][12.689618] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.28473,ms [API:I][12.690045] Memory create [CORE:V0][12.690033] Memory desc init by tag [memory] [CORE:I][12.690042] Memory created [memory] [API:I][12.690064] Memory create - strides [CORE:I][12.690050] Memory desc init by Stride [memory] [CORE:I][12.690053] Memory created [memory] [API:I][12.690075] Memory create [CORE:V0][12.690061] Memory desc init by tag [memory] [CORE:I][12.690068] Memory created [memory] [API:I][12.690093] Memory create [CORE:V0][12.690078] Memory desc init by tag [memory] [CORE:I][12.690083] Memory created [memory] [API:I][12.690108] Memory create [CORE:V0][12.690095] Memory desc init by tag [memory] [CORE:I][12.690100] Memory created [memory] [API:I][12.690036] matmul desc create - no bias [CORE:I][12.690034] matmul desc init [matmul] [API:I][12.690056] matmul primitive_desc create - attr [PROF:I][12.689841] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00396,ms [API:I][12.690075] matmul primitive create [CORE:I][12.690033] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.690036] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.675311] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.681890] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.58ms graph_exe_count=-1 weight_address=0x70dc60bc7040 [PROF:I][12.696478] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.62197,ms [API:I][12.696901] Memory create [CORE:V0][12.696889] Memory desc init by tag [memory] [CORE:I][12.696898] Memory created [memory] [API:I][12.696919] Memory create - strides [CORE:I][12.696905] Memory desc init by Stride [memory] [CORE:I][12.696910] Memory created [memory] [API:I][12.696931] Memory create [CORE:V0][12.696918] Memory desc init by tag [memory] [CORE:I][12.696923] Memory created [memory] [API:I][12.696857] matmul desc create - no bias [CORE:I][12.696855] matmul desc init [matmul] [API:I][12.696888] matmul primitive_desc create - attr [PROF:I][12.696674] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00356,ms [API:I][12.696907] matmul primitive create [CORE:I][12.696864] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.696868] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.682141] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.688555] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.416ms graph_exe_count=-1 weight_address=0x70dc6ebc8040 [PROF:I][12.703141] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.45352,ms [API:I][12.703625] Memory create [CORE:V0][12.703614] Memory desc init by tag [memory] [CORE:I][12.703624] Memory created [memory] [API:I][12.703646] Memory create - strides [CORE:I][12.703632] Memory desc init by Stride [memory] [CORE:I][12.703636] Memory created [memory] [API:I][12.703657] Memory create [CORE:V0][12.703643] Memory desc init by tag [memory] [CORE:I][12.703648] Memory created [memory] [API:I][12.703671] Memory create [CORE:V0][12.703656] Memory desc init by tag [memory] [CORE:I][12.703659] Memory created [memory] [API:I][12.703681] Memory create - strides [CORE:I][12.703666] Memory desc init by Stride [memory] [CORE:I][12.703671] Memory created [memory] [API:I][12.703694] Memory create [CORE:V0][12.703679] Memory desc init by tag [memory] [CORE:I][12.703683] Memory created [memory] [API:I][12.703705] Memory create [CORE:V0][12.703690] Memory desc init by tag [memory] [CORE:I][12.703694] Memory created [memory] [API:I][12.703716] Memory create - strides [CORE:I][12.703701] Memory desc init by Stride [memory] [CORE:I][12.703705] Memory created [memory] [API:I][12.703726] Memory create [CORE:V0][12.703711] Memory desc init by tag [memory] [CORE:I][12.703717] Memory created [memory] [API:I][12.688973] CPU Engine create [CORE:V0][12.703795] CPU Engine created [engine] [CORE:I][12.703799] CPU Engine created [cpu/engine] [API:I][12.688985] CPU Stream create [CORE:I][12.703407] CPU Stream created [stream] [CORE:V0][12.703406] CPU Stream created [cpu/stream] [API:I][12.689006] matmul desc create - no bias [CORE:I][12.703677] matmul desc init [matmul] [API:I][12.689026] matmul primitive_desc create - attr [PROF:I][12.703485] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00467,ms [API:I][12.689045] matmul primitive create [CORE:I][12.703674] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.703678] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.688953] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.690663] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.711ms graph_exe_count=-1 weight_address=0x70dc20bc1040 [PROF:I][12.705234] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.73708,ms [API:I][12.690798] matmul desc create - no bias [CORE:I][12.705469] matmul desc init [matmul] [API:I][12.690811] matmul primitive_desc create - attr [PROF:I][12.705264] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00162,ms [API:I][12.690824] matmul primitive create [CORE:I][12.705450] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.705453] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.690724] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.691195] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.47ms graph_exe_count=-1 weight_address=0x43206ec0 [PROF:I][12.705765] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.490526,ms [API:I][12.691326] matmul desc create - no bias [CORE:I][12.705996] matmul desc init [matmul] [API:I][12.691334] matmul primitive_desc create - attr [PROF:I][12.705785] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00087,ms [API:I][12.691344] matmul primitive create [CORE:I][12.705970] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.705973] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.691244] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.691586] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.342ms graph_exe_count=-1 weight_address=0x44206f00 [PROF:I][12.706156] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.360822,ms [PROF:V0][12.691715] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.74316,ms [CORE:I][12.706139] CPU Stream deleted [stream] [CORE:I][12.706544] CPU Engine deleted [engine] [API:I][12.706713] Memory create [CORE:V0][12.706703] Memory desc init by tag [memory] [CORE:I][12.706710] Memory created [memory] [API:I][12.706731] Memory create - strides [CORE:I][12.706717] Memory desc init by Stride [memory] [CORE:I][12.706720] Memory created [memory] [API:I][12.706743] Memory create [CORE:V0][12.706729] Memory desc init by tag [memory] [CORE:I][12.706733] Memory created [memory] [API:I][12.706664] matmul desc create - no bias [CORE:I][12.706662] matmul desc init [matmul] [API:I][12.706679] matmul primitive_desc create - attr [PROF:I][12.706461] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00203,ms [API:I][12.706693] matmul primitive create [CORE:I][12.706648] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.706652] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.691923] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.693607] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.684ms graph_exe_count=-1 weight_address=0x70dc24bc2040 [PROF:I][12.708177] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.7048,ms [API:I][12.708545] Memory create [CORE:V0][12.708532] Memory desc init by tag [memory] [CORE:I][12.708537] Memory created [memory] [API:I][12.708560] Memory create - strides [CORE:I][12.708545] Memory desc init by Stride [memory] [CORE:I][12.708549] Memory created [memory] [API:I][12.708570] Memory create [CORE:V0][12.708555] Memory desc init by tag [memory] [CORE:I][12.708560] Memory created [memory] [API:I][12.708489] matmul desc create - no bias [CORE:I][12.708486] matmul desc init [matmul] [API:I][12.708501] matmul primitive_desc create - attr [PROF:I][12.708281] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00135,ms [API:I][12.708514] matmul primitive create [CORE:I][12.708468] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.708472] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.693743] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.700250] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.506ms graph_exe_count=-1 weight_address=0x70dc28bc3040 [PROF:I][12.714839] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.5465,ms [API:I][12.715261] Memory create [CORE:V0][12.715249] Memory desc init by tag [memory] [CORE:I][12.715259] Memory created [memory] [API:I][12.715280] Memory create - strides [CORE:I][12.715266] Memory desc init by Stride [memory] [CORE:I][12.715273] Memory created [memory] [API:I][12.715294] Memory create [CORE:V0][12.715279] Memory desc init by tag [memory] [CORE:I][12.715285] Memory created [memory] [API:I][12.715310] Memory create [CORE:V0][12.715296] Memory desc init by tag [memory] [CORE:I][12.715300] Memory created [memory] [API:I][12.715326] Memory create [CORE:V0][12.715311] Memory desc init by tag [memory] [CORE:I][12.715315] Memory created [memory] [API:I][12.715249] matmul desc create - no bias [CORE:I][12.715249] matmul desc init [matmul] [API:I][12.715271] matmul primitive_desc create - attr [PROF:I][12.715056] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00378,ms [API:I][12.715290] matmul primitive create [CORE:I][12.715247] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.715251] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.700526] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.706798] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.274ms graph_exe_count=-1 weight_address=0x70dc36bc4040 [PROF:I][12.721382] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.31079,ms [API:I][12.721801] Memory create [CORE:V0][12.721789] Memory desc init by tag [memory] [CORE:I][12.721798] Memory created [memory] [API:I][12.721820] Memory create - strides [CORE:I][12.721806] Memory desc init by Stride [memory] [CORE:I][12.721810] Memory created [memory] [API:I][12.721830] Memory create [CORE:V0][12.721816] Memory desc init by tag [memory] [CORE:I][12.721821] Memory created [memory] [API:I][12.721754] matmul desc create - no bias [CORE:I][12.721752] matmul desc init [matmul] [API:I][12.721773] matmul primitive_desc create - attr [PROF:I][12.721557] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00321,ms [API:I][12.721790] matmul primitive create [CORE:I][12.721748] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.721751] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.707024] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.713173] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.149ms graph_exe_count=-1 weight_address=0x70dc44bc5040 [PROF:I][12.727760] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.18823,ms [API:I][12.728240] Memory create [CORE:V0][12.728230] Memory desc init by tag [memory] [CORE:I][12.728241] Memory created [memory] [API:I][12.728263] Memory create - strides [CORE:I][12.728249] Memory desc init by Stride [memory] [CORE:I][12.728254] Memory created [memory] [API:I][12.728275] Memory create [CORE:V0][12.728260] Memory desc init by tag [memory] [CORE:I][12.728266] Memory created [memory] [API:I][12.728288] Memory create [CORE:V0][12.728275] Memory desc init by tag [memory] [CORE:I][12.728279] Memory created [memory] [API:I][12.728300] Memory create - strides [CORE:I][12.728285] Memory desc init by Stride [memory] [CORE:I][12.728291] Memory created [memory] [API:I][12.728313] Memory create [CORE:V0][12.728299] Memory desc init by tag [memory] [CORE:I][12.728303] Memory created [memory] [API:I][12.728325] Memory create [CORE:V0][12.728311] Memory desc init by tag [memory] [CORE:I][12.728315] Memory created [memory] [API:I][12.728336] Memory create - strides [CORE:I][12.728321] Memory desc init by Stride [memory] [CORE:I][12.728326] Memory created [memory] [API:I][12.728349] Memory create [CORE:V0][12.728334] Memory desc init by tag [memory] [CORE:I][12.728339] Memory created [memory] [API:I][12.713595] CPU Engine create [CORE:V0][12.728417] CPU Engine created [engine] [CORE:I][12.728421] CPU Engine created [cpu/engine] [API:I][12.713607] CPU Stream create [CORE:I][12.728028] CPU Stream created [stream] [CORE:V0][12.728030] CPU Stream created [cpu/stream] [API:I][12.713629] matmul desc create - no bias [CORE:I][12.728300] matmul desc init [matmul] [API:I][12.713648] matmul primitive_desc create - attr [PROF:I][12.728105] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00337,ms [API:I][12.713665] matmul primitive create [CORE:I][12.728295] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.728299] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.713574] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.715241] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.668ms graph_exe_count=-1 weight_address=0x70dbeebbc040 [PROF:I][12.729812] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.69421,ms [API:I][12.715375] matmul desc create - no bias [CORE:I][12.730046] matmul desc init [matmul] [API:I][12.715387] matmul primitive_desc create - attr [PROF:I][12.729841] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00168,ms [API:I][12.715400] matmul primitive create [CORE:I][12.730026] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.730029] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.715300] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.715713] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.414ms graph_exe_count=-1 weight_address=0x4520efc0 [PROF:I][12.730283] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.433234,ms [API:I][12.715844] matmul desc create - no bias [CORE:I][12.730514] matmul desc init [matmul] [API:I][12.715852] matmul primitive_desc create - attr [PROF:I][12.730303] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00083,ms [API:I][12.715862] matmul primitive create [CORE:I][12.730487] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.730490] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.715761] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.716074] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.314ms graph_exe_count=-1 weight_address=0x4620f000 [PROF:I][12.730659] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.347601,ms [PROF:V0][12.716219] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.625,ms [CORE:I][12.730644] CPU Stream deleted [stream] [CORE:I][12.731048] CPU Engine deleted [engine] [API:I][12.731220] Memory create [CORE:V0][12.731209] Memory desc init by tag [memory] [CORE:I][12.731215] Memory created [memory] [API:I][12.731236] Memory create - strides [CORE:I][12.731222] Memory desc init by Stride [memory] [CORE:I][12.731225] Memory created [memory] [API:I][12.731247] Memory create [CORE:V0][12.731232] Memory desc init by tag [memory] [CORE:I][12.731236] Memory created [memory] [API:I][12.731167] matmul desc create - no bias [CORE:I][12.731164] matmul desc init [matmul] [API:I][12.731181] matmul primitive_desc create - attr [PROF:I][12.730962] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00196,ms [API:I][12.731195] matmul primitive create [CORE:I][12.731149] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.731152] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.716423] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.718098] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.675ms graph_exe_count=-1 weight_address=0x70dbf2bbd040 [PROF:I][12.732669] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.69602,ms [API:I][12.733035] Memory create [CORE:V0][12.733022] Memory desc init by tag [memory] [CORE:I][12.733026] Memory created [memory] [API:I][12.733047] Memory create - strides [CORE:I][12.733033] Memory desc init by Stride [memory] [CORE:I][12.733036] Memory created [memory] [API:I][12.733057] Memory create [CORE:V0][12.733043] Memory desc init by tag [memory] [CORE:I][12.733047] Memory created [memory] [API:I][12.732977] matmul desc create - no bias [CORE:I][12.732975] matmul desc init [matmul] [API:I][12.732990] matmul primitive_desc create - attr [PROF:I][12.732771] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00171,ms [API:I][12.733003] matmul primitive create [CORE:I][12.732958] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.732962] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.718233] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.724540] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.306ms graph_exe_count=-1 weight_address=0x70dbf6bbe040 [PROF:I][12.739127] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.34331,ms [API:I][12.739536] Memory create [CORE:V0][12.739524] Memory desc init by tag [memory] [CORE:I][12.739532] Memory created [memory] [API:I][12.739554] Memory create - strides [CORE:I][12.739540] Memory desc init by Stride [memory] [CORE:I][12.739543] Memory created [memory] [API:I][12.739564] Memory create [CORE:V0][12.739549] Memory desc init by tag [memory] [CORE:I][12.739553] Memory created [memory] [API:I][12.739578] Memory create [CORE:V0][12.739564] Memory desc init by tag [memory] [CORE:I][12.739569] Memory created [memory] [API:I][12.739595] Memory create [CORE:V0][12.739580] Memory desc init by tag [memory] [CORE:I][12.739585] Memory created [memory] [API:I][12.739521] matmul desc create - no bias [CORE:I][12.739519] matmul desc init [matmul] [API:I][12.739540] matmul primitive_desc create - attr [PROF:I][12.739324] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.0036,ms [API:I][12.739557] matmul primitive create [CORE:I][12.739516] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.739520] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.724794] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.731080] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.287ms graph_exe_count=-1 weight_address=0x70dc04bbf040 [PROF:I][12.745674] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.33391,ms [API:I][12.746097] Memory create [CORE:V0][12.746085] Memory desc init by tag [memory] [CORE:I][12.746093] Memory created [memory] [API:I][12.746115] Memory create - strides [CORE:I][12.746101] Memory desc init by Stride [memory] [CORE:I][12.746105] Memory created [memory] [API:I][12.746126] Memory create [CORE:V0][12.746111] Memory desc init by tag [memory] [CORE:I][12.746119] Memory created [memory] [API:I][12.746052] matmul desc create - no bias [CORE:I][12.746050] matmul desc init [matmul] [API:I][12.746073] matmul primitive_desc create - attr [PROF:I][12.745857] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00274,ms [API:I][12.746089] matmul primitive create [CORE:I][12.746050] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.746054] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.731326] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.737682] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.357ms graph_exe_count=-1 weight_address=0x70dc12bc0040 [PROF:I][12.752268] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.3963,ms [API:I][12.752747] Memory create [CORE:V0][12.752737] Memory desc init by tag [memory] [CORE:I][12.752746] Memory created [memory] [API:I][12.752768] Memory create - strides [CORE:I][12.752754] Memory desc init by Stride [memory] [CORE:I][12.752758] Memory created [memory] [API:I][12.752779] Memory create [CORE:V0][12.752765] Memory desc init by tag [memory] [CORE:I][12.752769] Memory created [memory] [API:I][12.752792] Memory create [CORE:V0][12.752778] Memory desc init by tag [memory] [CORE:I][12.752782] Memory created [memory] [API:I][12.752804] Memory create - strides [CORE:I][12.752789] Memory desc init by Stride [memory] [CORE:I][12.752794] Memory created [memory] [API:I][12.752817] Memory create [CORE:V0][12.752803] Memory desc init by tag [memory] [CORE:I][12.752807] Memory created [memory] [API:I][12.752829] Memory create [CORE:V0][12.752814] Memory desc init by tag [memory] [CORE:I][12.752818] Memory created [memory] [API:I][12.752839] Memory create - strides [CORE:I][12.752825] Memory desc init by Stride [memory] [CORE:I][12.752831] Memory created [memory] [API:I][12.752852] Memory create [CORE:V0][12.752837] Memory desc init by tag [memory] [CORE:I][12.752842] Memory created [memory] [API:I][12.738098] CPU Engine create [CORE:V0][12.752921] CPU Engine created [engine] [CORE:I][12.752925] CPU Engine created [cpu/engine] [API:I][12.738112] CPU Stream create [CORE:I][12.752533] CPU Stream created [stream] [CORE:V0][12.752533] CPU Stream created [cpu/stream] [API:I][12.738132] matmul desc create - no bias [CORE:I][12.752803] matmul desc init [matmul] [API:I][12.738152] matmul primitive_desc create - attr [PROF:I][12.752610] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.003501,ms [API:I][12.738170] matmul primitive create [CORE:I][12.752798] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.752803] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.738077] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.739816] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.74ms graph_exe_count=-1 weight_address=0x70dbbcbb7040 [PROF:I][12.754387] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.76493,ms [API:I][12.739949] matmul desc create - no bias [CORE:I][12.754619] matmul desc init [matmul] [API:I][12.739961] matmul primitive_desc create - attr [PROF:I][12.754415] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.0022,ms [API:I][12.739974] matmul primitive create [CORE:I][12.754600] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.754603] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.739874] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.740355] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.481ms graph_exe_count=-1 weight_address=0x472170c0 [PROF:I][12.754925] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.500577,ms [API:I][12.740486] matmul desc create - no bias [CORE:I][12.755156] matmul desc init [matmul] [API:I][12.740494] matmul primitive_desc create - attr [PROF:I][12.754945] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00103,ms [API:I][12.740504] matmul primitive create [CORE:I][12.755130] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.755133] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.740404] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.740724] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.32ms graph_exe_count=-1 weight_address=0x48217100 [PROF:I][12.755294] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.338541,ms [PROF:V0][12.740853] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.7561,ms [CORE:I][12.755278] CPU Stream deleted [stream] [CORE:I][12.755684] CPU Engine deleted [engine] [API:I][12.755842] Memory create [CORE:V0][12.755831] Memory desc init by tag [memory] [CORE:I][12.755837] Memory created [memory] [API:I][12.755858] Memory create - strides [CORE:I][12.755844] Memory desc init by Stride [memory] [CORE:I][12.755847] Memory created [memory] [API:I][12.755869] Memory create [CORE:V0][12.755855] Memory desc init by tag [memory] [CORE:I][12.755859] Memory created [memory] [API:I][12.755789] matmul desc create - no bias [CORE:I][12.755787] matmul desc init [matmul] [API:I][12.755802] matmul primitive_desc create - attr [PROF:I][12.755583] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.002,ms [API:I][12.755815] matmul primitive create [CORE:I][12.755771] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.755775] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.741046] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.742731] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.685ms graph_exe_count=-1 weight_address=0x70dbc0bb8040 [PROF:I][12.757301] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.70636,ms [API:I][12.757671] Memory create [CORE:V0][12.757658] Memory desc init by tag [memory] [CORE:I][12.757662] Memory created [memory] [API:I][12.757684] Memory create - strides [CORE:I][12.757669] Memory desc init by Stride [memory] [CORE:I][12.757674] Memory created [memory] [API:I][12.757695] Memory create [CORE:V0][12.757681] Memory desc init by tag [memory] [CORE:I][12.757686] Memory created [memory] [API:I][12.757615] matmul desc create - no bias [CORE:I][12.757612] matmul desc init [matmul] [API:I][12.757629] matmul primitive_desc create - attr [PROF:I][12.757409] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00139,ms [API:I][12.757641] matmul primitive create [CORE:I][12.757596] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.757601] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.742872] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.749139] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.268ms graph_exe_count=-1 weight_address=0x70dbc4bb9040 [PROF:I][12.763725] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.30462,ms [API:I][12.764141] Memory create [CORE:V0][12.764128] Memory desc init by tag [memory] [CORE:I][12.764138] Memory created [memory] [API:I][12.764160] Memory create - strides [CORE:I][12.764146] Memory desc init by Stride [memory] [CORE:I][12.764151] Memory created [memory] [API:I][12.764172] Memory create [CORE:V0][12.764157] Memory desc init by tag [memory] [CORE:I][12.764161] Memory created [memory] [API:I][12.764186] Memory create [CORE:V0][12.764171] Memory desc init by tag [memory] [CORE:I][12.764176] Memory created [memory] [API:I][12.764202] Memory create [CORE:V0][12.764188] Memory desc init by tag [memory] [CORE:I][12.764192] Memory created [memory] [API:I][12.764127] matmul desc create - no bias [CORE:I][12.764126] matmul desc init [matmul] [API:I][12.764149] matmul primitive_desc create - attr [PROF:I][12.763934] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00363,ms [API:I][12.764167] matmul primitive create [CORE:I][12.764125] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.764128] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.749402] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.755842] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.442ms graph_exe_count=-1 weight_address=0x70dbd2bba040 [PROF:I][12.770427] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.47887,ms [API:I][12.770847] Memory create [CORE:V0][12.770835] Memory desc init by tag [memory] [CORE:I][12.770844] Memory created [memory] [API:I][12.770866] Memory create - strides [CORE:I][12.770855] Memory desc init by Stride [memory] [CORE:I][12.770858] Memory created [memory] [API:I][12.770879] Memory create [CORE:V0][12.770863] Memory desc init by tag [memory] [CORE:I][12.770869] Memory created [memory] [API:I][12.770805] matmul desc create - no bias [CORE:I][12.770803] matmul desc init [matmul] [API:I][12.770825] matmul primitive_desc create - attr [PROF:I][12.770610] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.003471,ms [API:I][12.770842] matmul primitive create [CORE:I][12.770802] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.770806] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.756079] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.761843] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=5.765ms graph_exe_count=-1 weight_address=0x70dbe0bbb040 [PROF:I][12.776430] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,5.80622,ms [API:I][12.776914] Memory create [CORE:V0][12.776904] Memory desc init by tag [memory] [CORE:I][12.776913] Memory created [memory] [API:I][12.776935] Memory create - strides [CORE:I][12.776921] Memory desc init by Stride [memory] [CORE:I][12.776925] Memory created [memory] [API:I][12.776946] Memory create [CORE:V0][12.776931] Memory desc init by tag [memory] [CORE:I][12.776936] Memory created [memory] [API:I][12.776958] Memory create [CORE:V0][12.776944] Memory desc init by tag [memory] [CORE:I][12.776955] Memory created [memory] [API:I][12.776977] Memory create - strides [CORE:I][12.776962] Memory desc init by Stride [memory] [CORE:I][12.776966] Memory created [memory] [API:I][12.776989] Memory create [CORE:V0][12.776974] Memory desc init by tag [memory] [CORE:I][12.776978] Memory created [memory] [API:I][12.777000] Memory create [CORE:V0][12.776985] Memory desc init by tag [memory] [CORE:I][12.776990] Memory created [memory] [API:I][12.777011] Memory create - strides [CORE:I][12.776997] Memory desc init by Stride [memory] [CORE:I][12.777000] Memory created [memory] [API:I][12.777021] Memory create [CORE:V0][12.777006] Memory desc init by tag [memory] [CORE:I][12.777012] Memory created [memory] [API:I][12.762267] CPU Engine create [CORE:V0][12.777089] CPU Engine created [engine] [CORE:I][12.777092] CPU Engine created [cpu/engine] [API:I][12.762278] CPU Stream create [CORE:I][12.776700] CPU Stream created [stream] [CORE:V0][12.776700] CPU Stream created [cpu/stream] [API:I][12.762299] matmul desc create - no bias [CORE:I][12.776970] matmul desc init [matmul] [API:I][12.762319] matmul primitive_desc create - attr [PROF:I][12.776776] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00394,ms [API:I][12.762336] matmul primitive create [CORE:I][12.776966] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.776970] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.762244] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.764033] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.789ms graph_exe_count=-1 weight_address=0x70db8abb2040 [PROF:I][12.778604] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.81572,ms [API:I][12.764168] matmul desc create - no bias [CORE:I][12.778839] matmul desc init [matmul] [API:I][12.764181] matmul primitive_desc create - attr [PROF:I][12.778634] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.001401,ms [API:I][12.764194] matmul primitive create [CORE:I][12.778831] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.778835] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.764107] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.764520] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.413ms graph_exe_count=-1 weight_address=0x4921f1c0 [PROF:I][12.779090] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.434395,ms [API:I][12.764650] matmul desc create - no bias [CORE:I][12.779320] matmul desc init [matmul] [API:I][12.764658] matmul primitive_desc create - attr [PROF:I][12.779110] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00087,ms [API:I][12.764668] matmul primitive create [CORE:I][12.779294] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.779297] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.764567] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.764902] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.335ms graph_exe_count=-1 weight_address=0x4a21f200 [PROF:I][12.779472] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.353512,ms [PROF:V0][12.765032] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.76489,ms [CORE:I][12.779456] CPU Stream deleted [stream] [CORE:I][12.779861] CPU Engine deleted [engine] [API:I][12.780043] Memory create [CORE:V0][12.780032] Memory desc init by tag [memory] [CORE:I][12.780039] Memory created [memory] [API:I][12.780060] Memory create - strides [CORE:I][12.780046] Memory desc init by Stride [memory] [CORE:I][12.780050] Memory created [memory] [API:I][12.780071] Memory create [CORE:V0][12.780057] Memory desc init by tag [memory] [CORE:I][12.780062] Memory created [memory] [API:I][12.779991] matmul desc create - no bias [CORE:I][12.779989] matmul desc init [matmul] [API:I][12.780006] matmul primitive_desc create - attr [PROF:I][12.779787] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00182,ms [API:I][12.780019] matmul primitive create [CORE:I][12.779975] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.779979] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.765250] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.766990] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.74ms graph_exe_count=-1 weight_address=0x70db8ebb3040 [PROF:I][12.781561] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.76222,ms [API:I][12.781927] Memory create [CORE:V0][12.781913] Memory desc init by tag [memory] [CORE:I][12.781918] Memory created [memory] [API:I][12.781940] Memory create - strides [CORE:I][12.781925] Memory desc init by Stride [memory] [CORE:I][12.781929] Memory created [memory] [API:I][12.781950] Memory create [CORE:V0][12.781936] Memory desc init by tag [memory] [CORE:I][12.781940] Memory created [memory] [API:I][12.781869] matmul desc create - no bias [CORE:I][12.781874] matmul desc init [matmul] [API:I][12.781889] matmul primitive_desc create - attr [PROF:I][12.781672] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00141,ms [API:I][12.781903] matmul primitive create [CORE:I][12.781860] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.781867] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.767141] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.773527] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.385ms graph_exe_count=-1 weight_address=0x70db92bb4040 [PROF:I][12.788115] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.42992,ms [API:I][12.788543] Memory create [CORE:V0][12.788532] Memory desc init by tag [memory] [CORE:I][12.788542] Memory created [memory] [API:I][12.788566] Memory create - strides [CORE:I][12.788554] Memory desc init by Stride [memory] [CORE:I][12.788559] Memory created [memory] [API:I][12.788582] Memory create [CORE:V0][12.788569] Memory desc init by tag [memory] [CORE:I][12.788577] Memory created [memory] [API:I][12.788604] Memory create [CORE:V0][12.788590] Memory desc init by tag [memory] [CORE:I][12.788597] Memory created [memory] [API:I][12.788626] Memory create [CORE:V0][12.788613] Memory desc init by tag [memory] [CORE:I][12.788620] Memory created [memory] [API:I][12.788560] matmul desc create - no bias [CORE:I][12.788559] matmul desc init [matmul] [API:I][12.788585] matmul primitive_desc create - attr [PROF:I][12.788374] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00437,ms [API:I][12.788608] matmul primitive create [CORE:I][12.788567] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.788571] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.773846] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.780400] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.557ms graph_exe_count=-1 weight_address=0x70dba0bb5040 [PROF:I][12.794988] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.59796,ms [API:I][12.795423] Memory create [CORE:V0][12.795413] Memory desc init by tag [memory] [CORE:I][12.795424] Memory created [memory] [API:I][12.795446] Memory create - strides [CORE:I][12.795433] Memory desc init by Stride [memory] [CORE:I][12.795439] Memory created [memory] [API:I][12.795462] Memory create [CORE:V0][12.795449] Memory desc init by tag [memory] [CORE:I][12.795454] Memory created [memory] [API:I][12.795388] matmul desc create - no bias [CORE:I][12.795386] matmul desc init [matmul] [API:I][12.795412] matmul primitive_desc create - attr [PROF:I][12.795200] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.003661,ms [API:I][12.795433] matmul primitive create [CORE:I][12.795393] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.795397] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.780672] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.787380] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.709ms graph_exe_count=-1 weight_address=0x70dbaebb6040 [PROF:I][12.801966] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.74896,ms [API:I][12.802454] Memory create [CORE:V0][12.802446] Memory desc init by tag [memory] [CORE:I][12.802457] Memory created [memory] [API:I][12.802479] Memory create - strides [CORE:I][12.802467] Memory desc init by Stride [memory] [CORE:I][12.802473] Memory created [memory] [API:I][12.802496] Memory create [CORE:V0][12.802484] Memory desc init by tag [memory] [CORE:I][12.802489] Memory created [memory] [API:I][12.802515] Memory create [CORE:V0][12.802501] Memory desc init by tag [memory] [CORE:I][12.802507] Memory created [memory] [API:I][12.802531] Memory create - strides [CORE:I][12.802517] Memory desc init by Stride [memory] [CORE:I][12.802523] Memory created [memory] [API:I][12.802546] Memory create [CORE:V0][12.802531] Memory desc init by tag [memory] [CORE:I][12.802537] Memory created [memory] [API:I][12.802562] Memory create [CORE:V0][12.802548] Memory desc init by tag [memory] [CORE:I][12.802556] Memory created [memory] [API:I][12.802579] Memory create - strides [CORE:I][12.802566] Memory desc init by Stride [memory] [CORE:I][12.802571] Memory created [memory] [API:I][12.802593] Memory create [CORE:V0][12.802580] Memory desc init by tag [memory] [CORE:I][12.802585] Memory created [memory] [API:I][12.787843] CPU Engine create [CORE:V0][12.802667] CPU Engine created [engine] [CORE:I][12.802673] CPU Engine created [cpu/engine] [API:I][12.787861] CPU Stream create [CORE:I][12.802284] CPU Stream created [stream] [CORE:V0][12.802285] CPU Stream created [cpu/stream] [API:I][12.787885] matmul desc create - no bias [CORE:I][12.802558] matmul desc init [matmul] [API:I][12.787910] matmul primitive_desc create - attr [PROF:I][12.802370] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.005,ms [API:I][12.787932] matmul primitive create [CORE:I][12.802562] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.802567] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.787841] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.789467] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.626ms graph_exe_count=-1 weight_address=0x70db74baf040 [PROF:I][12.804039] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.65292,ms [API:I][12.789604] matmul desc create - no bias [CORE:I][12.804275] matmul desc init [matmul] [API:I][12.789620] matmul primitive_desc create - attr [PROF:I][12.804075] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.002431,ms [API:I][12.789635] matmul primitive create [CORE:I][12.804265] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.804269] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.789544] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.790001] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.458ms graph_exe_count=-1 weight_address=0x4b2272c0 [PROF:I][12.804571] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.482416,ms [API:I][12.790135] matmul desc create - no bias [CORE:I][12.804806] matmul desc init [matmul] [API:I][12.790147] matmul primitive_desc create - attr [PROF:I][12.804600] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00157,ms [API:I][12.790161] matmul primitive create [CORE:I][12.804790] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.804795] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.790069] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.790395] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.327ms graph_exe_count=-1 weight_address=0x4c227300 [PROF:I][12.804966] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.351982,ms [PROF:V0][12.790529] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.68506,ms [CORE:I][12.804954] CPU Stream deleted [stream] [CORE:I][12.805359] CPU Engine deleted [engine] [API:I][12.805523] Memory create [CORE:V0][12.805513] Memory desc init by tag [memory] [CORE:I][12.805521] Memory created [memory] [API:I][12.805544] Memory create - strides [CORE:I][12.805530] Memory desc init by Stride [memory] [CORE:I][12.805536] Memory created [memory] [API:I][12.805559] Memory create [CORE:V0][12.805546] Memory desc init by tag [memory] [CORE:I][12.805553] Memory created [memory] [API:I][12.805486] matmul desc create - no bias [CORE:I][12.805485] matmul desc init [matmul] [API:I][12.805505] matmul primitive_desc create - attr [PROF:I][12.805288] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00262,ms [API:I][12.805521] matmul primitive create [CORE:I][12.805479] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.805483] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.790755] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.792463] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.709ms graph_exe_count=-1 weight_address=0x70db78bb0040 [PROF:I][12.807034] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.73074,ms [API:I][12.807407] Memory create [CORE:V0][12.807395] Memory desc init by tag [memory] [CORE:I][12.807402] Memory created [memory] [API:I][12.807425] Memory create - strides [CORE:I][12.807411] Memory desc init by Stride [memory] [CORE:I][12.807416] Memory created [memory] [API:I][12.807439] Memory create [CORE:V0][12.807426] Memory desc init by tag [memory] [CORE:I][12.807432] Memory created [memory] [API:I][12.807363] matmul desc create - no bias [CORE:I][12.807361] matmul desc init [matmul] [API:I][12.807376] matmul primitive_desc create - attr [PROF:I][12.807157] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00156,ms [API:I][12.807388] matmul primitive create [CORE:I][12.807343] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.807347] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.792619] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.799020] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.402ms graph_exe_count=-1 weight_address=0x70d9331fe040 [PROF:I][12.813607] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.43928,ms [API:I][12.814037] Memory create [CORE:V0][12.814027] Memory desc init by tag [memory] [CORE:I][12.814038] Memory created [memory] [API:I][12.814060] Memory create - strides [CORE:I][12.814049] Memory desc init by Stride [memory] [CORE:I][12.814054] Memory created [memory] [API:I][12.814076] Memory create [CORE:V0][12.814063] Memory desc init by tag [memory] [CORE:I][12.814070] Memory created [memory] [API:I][12.814098] Memory create [CORE:V0][12.814084] Memory desc init by tag [memory] [CORE:I][12.814091] Memory created [memory] [API:I][12.814118] Memory create [CORE:V0][12.814105] Memory desc init by tag [memory] [CORE:I][12.814110] Memory created [memory] [API:I][12.814048] matmul desc create - no bias [CORE:I][12.814047] matmul desc init [matmul] [API:I][12.814070] matmul primitive_desc create - attr [PROF:I][12.813857] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00439,ms [API:I][12.814092] matmul primitive create [CORE:I][12.814052] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.814056] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.799333] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.805855] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.523ms graph_exe_count=-1 weight_address=0x70d9411ff040 [PROF:I][12.820441] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.56597,ms [API:I][12.820870] Memory create [CORE:V0][12.820858] Memory desc init by tag [memory] [CORE:I][12.820869] Memory created [memory] [API:I][12.820892] Memory create - strides [CORE:I][12.820879] Memory desc init by Stride [memory] [CORE:I][12.820884] Memory created [memory] [API:I][12.820907] Memory create [CORE:V0][12.820894] Memory desc init by tag [memory] [CORE:I][12.820904] Memory created [memory] [API:I][12.820840] matmul desc create - no bias [CORE:I][12.820840] matmul desc init [matmul] [API:I][12.820863] matmul primitive_desc create - attr [PROF:I][12.820657] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.0046,ms [API:I][12.820890] matmul primitive create [CORE:I][12.820850] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.820854] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.806128] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.812134] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.006ms graph_exe_count=-1 weight_address=0x70db7cbb1040 [PROF:I][12.826722] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.04768,ms [API:I][12.827217] Memory create [CORE:V0][12.827209] Memory desc init by tag [memory] [CORE:I][12.827220] Memory created [memory] [API:I][12.827242] Memory create - strides [CORE:I][12.827229] Memory desc init by Stride [memory] [CORE:I][12.827235] Memory created [memory] [API:I][12.827258] Memory create [CORE:V0][12.827245] Memory desc init by tag [memory] [CORE:I][12.827251] Memory created [memory] [API:I][12.827274] Memory create [CORE:V0][12.827259] Memory desc init by tag [memory] [CORE:I][12.827263] Memory created [memory] [API:I][12.827284] Memory create - strides [CORE:I][12.827270] Memory desc init by Stride [memory] [CORE:I][12.827276] Memory created [memory] [API:I][12.827302] Memory create [CORE:V0][12.827289] Memory desc init by tag [memory] [CORE:I][12.827295] Memory created [memory] [API:I][12.827319] Memory create [CORE:V0][12.827306] Memory desc init by tag [memory] [CORE:I][12.827311] Memory created [memory] [API:I][12.827332] Memory create - strides [CORE:I][12.827320] Memory desc init by Stride [memory] [CORE:I][12.827326] Memory created [memory] [API:I][12.827348] Memory create [CORE:V0][12.827335] Memory desc init by tag [memory] [CORE:I][12.827342] Memory created [memory] [API:I][12.812599] CPU Engine create [CORE:V0][12.827423] CPU Engine created [engine] [CORE:I][12.827430] CPU Engine created [cpu/engine] [API:I][12.812619] CPU Stream create [CORE:I][12.827043] CPU Stream created [stream] [CORE:V0][12.827044] CPU Stream created [cpu/stream] [API:I][12.812645] matmul desc create - no bias [CORE:I][12.827317] matmul desc init [matmul] [API:I][12.812669] matmul primitive_desc create - attr [PROF:I][12.827129] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00434,ms [API:I][12.812690] matmul primitive create [CORE:I][12.827321] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.827325] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.812600] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.814318] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.719ms graph_exe_count=-1 weight_address=0x70d9011f9040 [PROF:I][12.828889] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.74508,ms [API:I][12.814452] matmul desc create - no bias [CORE:I][12.829122] matmul desc init [matmul] [API:I][12.814467] matmul primitive_desc create - attr [PROF:I][12.828922] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00264,ms [API:I][12.814483] matmul primitive create [CORE:I][12.829113] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.829118] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.814393] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.814828] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.436ms graph_exe_count=-1 weight_address=0x4d22f3c0 [PROF:I][12.829398] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.461395,ms [API:I][12.814963] matmul desc create - no bias [CORE:I][12.829633] matmul desc init [matmul] [API:I][12.814974] matmul primitive_desc create - attr [PROF:I][12.829427] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00162,ms [API:I][12.814988] matmul primitive create [CORE:I][12.829618] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.829622] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.814896] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.815232] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.336ms graph_exe_count=-1 weight_address=0x4e22f400 [PROF:I][12.829803] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.360562,ms [PROF:V0][12.815366] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.76709,ms [CORE:I][12.829791] CPU Stream deleted [stream] [CORE:I][12.830196] CPU Engine deleted [engine] [API:I][12.830370] Memory create [CORE:V0][12.830359] Memory desc init by tag [memory] [CORE:I][12.830368] Memory created [memory] [API:I][12.830391] Memory create - strides [CORE:I][12.830377] Memory desc init by Stride [memory] [CORE:I][12.830382] Memory created [memory] [API:I][12.830405] Memory create [CORE:V0][12.830392] Memory desc init by tag [memory] [CORE:I][12.830398] Memory created [memory] [API:I][12.830330] matmul desc create - no bias [CORE:I][12.830327] matmul desc init [matmul] [API:I][12.830346] matmul primitive_desc create - attr [PROF:I][12.830129] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00259,ms [API:I][12.830363] matmul primitive create [CORE:I][12.830320] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.830324] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.815595] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.817286] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.691ms graph_exe_count=-1 weight_address=0x70d9051fa040 [PROF:I][12.831857] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.71297,ms [API:I][12.832225] Memory create [CORE:V0][12.832213] Memory desc init by tag [memory] [CORE:I][12.832220] Memory created [memory] [API:I][12.832242] Memory create - strides [CORE:I][12.832231] Memory desc init by Stride [memory] [CORE:I][12.832238] Memory created [memory] [API:I][12.832262] Memory create [CORE:V0][12.832249] Memory desc init by tag [memory] [CORE:I][12.832255] Memory created [memory] [API:I][12.832185] matmul desc create - no bias [CORE:I][12.832183] matmul desc init [matmul] [API:I][12.832200] matmul primitive_desc create - attr [PROF:I][12.831983] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00221,ms [API:I][12.832216] matmul primitive create [CORE:I][12.832174] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.832177] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.817448] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.823807] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.358ms graph_exe_count=-1 weight_address=0x70d9091fb040 [PROF:I][12.838391] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.39222,ms [API:I][12.838792] Memory create [CORE:V0][12.838780] Memory desc init by tag [memory] [CORE:I][12.838789] Memory created [memory] [API:I][12.838815] Memory create - strides [CORE:I][12.838803] Memory desc init by Stride [memory] [CORE:I][12.838808] Memory created [memory] [API:I][12.838830] Memory create [CORE:V0][12.838817] Memory desc init by tag [memory] [CORE:I][12.838824] Memory created [memory] [API:I][12.838849] Memory create [CORE:V0][12.838834] Memory desc init by tag [memory] [CORE:I][12.838841] Memory created [memory] [API:I][12.838869] Memory create [CORE:V0][12.838855] Memory desc init by tag [memory] [CORE:I][12.838862] Memory created [memory] [API:I][12.838796] matmul desc create - no bias [CORE:I][12.838794] matmul desc init [matmul] [API:I][12.838817] matmul primitive_desc create - attr [PROF:I][12.838602] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.003281,ms [API:I][12.838836] matmul primitive create [CORE:I][12.838795] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.838800] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.824074] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.830552] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.48ms graph_exe_count=-1 weight_address=0x70d9171fc040 [PROF:I][12.845137] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.51797,ms [API:I][12.845565] Memory create [CORE:V0][12.845555] Memory desc init by tag [memory] [CORE:I][12.845566] Memory created [memory] [API:I][12.845588] Memory create - strides [CORE:I][12.845577] Memory desc init by Stride [memory] [CORE:I][12.845582] Memory created [memory] [API:I][12.845603] Memory create [CORE:V0][12.845589] Memory desc init by tag [memory] [CORE:I][12.845596] Memory created [memory] [API:I][12.845534] matmul desc create - no bias [CORE:I][12.845533] matmul desc init [matmul] [API:I][12.845558] matmul primitive_desc create - attr [PROF:I][12.845345] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00396,ms [API:I][12.845578] matmul primitive create [CORE:I][12.845538] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.845542] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.830815] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.837470] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.656ms graph_exe_count=-1 weight_address=0x70d9251fd040 [PROF:I][12.852053] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.69203,ms [API:I][12.852525] Memory create [CORE:V0][12.852517] Memory desc init by tag [memory] [CORE:I][12.852528] Memory created [memory] [API:I][12.852551] Memory create - strides [CORE:I][12.852539] Memory desc init by Stride [memory] [CORE:I][12.852545] Memory created [memory] [API:I][12.852568] Memory create [CORE:V0][12.852554] Memory desc init by tag [memory] [CORE:I][12.852561] Memory created [memory] [API:I][12.852587] Memory create [CORE:V0][12.852574] Memory desc init by tag [memory] [CORE:I][12.852580] Memory created [memory] [API:I][12.852603] Memory create - strides [CORE:I][12.852591] Memory desc init by Stride [memory] [CORE:I][12.852598] Memory created [memory] [API:I][12.852623] Memory create [CORE:V0][12.852610] Memory desc init by tag [memory] [CORE:I][12.852616] Memory created [memory] [API:I][12.852642] Memory create [CORE:V0][12.852629] Memory desc init by tag [memory] [CORE:I][12.852635] Memory created [memory] [API:I][12.852657] Memory create - strides [CORE:I][12.852645] Memory desc init by Stride [memory] [CORE:I][12.852650] Memory created [memory] [API:I][12.852674] Memory create [CORE:V0][12.852661] Memory desc init by tag [memory] [CORE:I][12.852666] Memory created [memory] [API:I][12.837925] CPU Engine create [CORE:V0][12.852748] CPU Engine created [engine] [CORE:I][12.852752] CPU Engine created [cpu/engine] [API:I][12.837940] CPU Stream create [CORE:I][12.852363] CPU Stream created [stream] [CORE:V0][12.852364] CPU Stream created [cpu/stream] [API:I][12.837964] matmul desc create - no bias [CORE:I][12.852636] matmul desc init [matmul] [API:I][12.837985] matmul primitive_desc create - attr [PROF:I][12.852443] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00364,ms [API:I][12.838003] matmul primitive create [CORE:I][12.852632] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.852636] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.837910] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.839628] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.717ms graph_exe_count=-1 weight_address=0x70d8cf1f4040 [PROF:I][12.854200] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.74454,ms [API:I][12.839767] matmul desc create - no bias [CORE:I][12.854438] matmul desc init [matmul] [API:I][12.839784] matmul primitive_desc create - attr [PROF:I][12.854240] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00295,ms [API:I][12.839800] matmul primitive create [CORE:I][12.854432] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.854437] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.839712] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.840177] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.465ms graph_exe_count=-1 weight_address=0x4f2374c0 [PROF:I][12.854747] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.491886,ms [API:I][12.840311] matmul desc create - no bias [CORE:I][12.854981] matmul desc init [matmul] [API:I][12.840322] matmul primitive_desc create - attr [PROF:I][12.854776] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00151,ms [API:I][12.840337] matmul primitive create [CORE:I][12.854967] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.854971] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.840245] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.840586] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.34ms graph_exe_count=-1 weight_address=0x50237500 [PROF:I][12.855156] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.365582,ms [PROF:V0][12.840719] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.79492,ms [CORE:I][12.855144] CPU Stream deleted [stream] [CORE:I][12.855549] CPU Engine deleted [engine] [API:I][12.855725] Memory create [CORE:V0][12.855715] Memory desc init by tag [memory] [CORE:I][12.855723] Memory created [memory] [API:I][12.855746] Memory create - strides [CORE:I][12.855732] Memory desc init by Stride [memory] [CORE:I][12.855737] Memory created [memory] [API:I][12.855760] Memory create [CORE:V0][12.855747] Memory desc init by tag [memory] [CORE:I][12.855752] Memory created [memory] [API:I][12.855685] matmul desc create - no bias [CORE:I][12.855683] matmul desc init [matmul] [API:I][12.855702] matmul primitive_desc create - attr [PROF:I][12.855485] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00262,ms [API:I][12.855719] matmul primitive create [CORE:I][12.855676] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.855679] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.840951] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.842713] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.763ms graph_exe_count=-1 weight_address=0x70d8d31f5040 [PROF:I][12.857285] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.78534,ms [API:I][12.857662] Memory create [CORE:V0][12.857649] Memory desc init by tag [memory] [CORE:I][12.857654] Memory created [memory] [API:I][12.857678] Memory create - strides [CORE:I][12.857667] Memory desc init by Stride [memory] [CORE:I][12.857673] Memory created [memory] [API:I][12.857696] Memory create [CORE:V0][12.857683] Memory desc init by tag [memory] [CORE:I][12.857691] Memory created [memory] [API:I][12.857622] matmul desc create - no bias [CORE:I][12.857622] matmul desc init [matmul] [API:I][12.857640] matmul primitive_desc create - attr [PROF:I][12.857422] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00194,ms [API:I][12.857658] matmul primitive create [CORE:I][12.857615] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.857619] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.842894] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.849221] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.326ms graph_exe_count=-1 weight_address=0x70d8d71f6040 [PROF:I][12.863809] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.36962,ms [API:I][12.864234] Memory create [CORE:V0][12.864222] Memory desc init by tag [memory] [CORE:I][12.864232] Memory created [memory] [API:I][12.864257] Memory create - strides [CORE:I][12.864245] Memory desc init by Stride [memory] [CORE:I][12.864250] Memory created [memory] [API:I][12.864274] Memory create [CORE:V0][12.864261] Memory desc init by tag [memory] [CORE:I][12.864269] Memory created [memory] [API:I][12.864295] Memory create [CORE:V0][12.864281] Memory desc init by tag [memory] [CORE:I][12.864287] Memory created [memory] [API:I][12.864315] Memory create [CORE:V0][12.864302] Memory desc init by tag [memory] [CORE:I][12.864308] Memory created [memory] [API:I][12.864244] matmul desc create - no bias [CORE:I][12.864243] matmul desc init [matmul] [API:I][12.864267] matmul primitive_desc create - attr [PROF:I][12.864053] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.003701,ms [API:I][12.864286] matmul primitive create [CORE:I][12.864244] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.864248] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.849525] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.855796] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.272ms graph_exe_count=-1 weight_address=0x70d8e51f7040 [PROF:I][12.870385] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.31777,ms [API:I][12.870821] Memory create [CORE:V0][12.870810] Memory desc init by tag [memory] [CORE:I][12.870822] Memory created [memory] [API:I][12.870845] Memory create - strides [CORE:I][12.870836] Memory desc init by Stride [memory] [CORE:I][12.870841] Memory created [memory] [API:I][12.870864] Memory create [CORE:V0][12.870851] Memory desc init by tag [memory] [CORE:I][12.870857] Memory created [memory] [API:I][12.870792] matmul desc create - no bias [CORE:I][12.870790] matmul desc init [matmul] [API:I][12.870815] matmul primitive_desc create - attr [PROF:I][12.870600] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00339,ms [API:I][12.870833] matmul primitive create [CORE:I][12.870792] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.870796] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.856074] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.862724] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.65ms graph_exe_count=-1 weight_address=0x70d8f31f8040 [PROF:I][12.877313] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.69814,ms [API:I][12.877811] Memory create [CORE:V0][12.877801] Memory desc init by tag [memory] [CORE:I][12.877811] Memory created [memory] [API:I][12.877833] Memory create - strides [CORE:I][12.877822] Memory desc init by Stride [memory] [CORE:I][12.877827] Memory created [memory] [API:I][12.877849] Memory create [CORE:V0][12.877836] Memory desc init by tag [memory] [CORE:I][12.877843] Memory created [memory] [API:I][12.877867] Memory create [CORE:V0][12.877853] Memory desc init by tag [memory] [CORE:I][12.877860] Memory created [memory] [API:I][12.877883] Memory create - strides [CORE:I][12.877869] Memory desc init by Stride [memory] [CORE:I][12.877876] Memory created [memory] [API:I][12.877900] Memory create [CORE:V0][12.877885] Memory desc init by tag [memory] [CORE:I][12.877890] Memory created [memory] [API:I][12.877914] Memory create [CORE:V0][12.877901] Memory desc init by tag [memory] [CORE:I][12.877908] Memory created [memory] [API:I][12.877932] Memory create - strides [CORE:I][12.877918] Memory desc init by Stride [memory] [CORE:I][12.877924] Memory created [memory] [API:I][12.877947] Memory create [CORE:V0][12.877934] Memory desc init by tag [memory] [CORE:I][12.877940] Memory created [memory] [API:I][12.863204] CPU Engine create [CORE:V0][12.878028] CPU Engine created [engine] [CORE:I][12.878034] CPU Engine created [cpu/engine] [API:I][12.863222] CPU Stream create [CORE:I][12.877645] CPU Stream created [stream] [CORE:V0][12.877646] CPU Stream created [cpu/stream] [API:I][12.863247] matmul desc create - no bias [CORE:I][12.877919] matmul desc init [matmul] [API:I][12.863270] matmul primitive_desc create - attr [PROF:I][12.877730] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.004801,ms [API:I][12.863292] matmul primitive create [CORE:I][12.877924] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.877928] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.863203] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.864918] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.716ms graph_exe_count=-1 weight_address=0x70d89d1ef040 [PROF:I][12.879489] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.74252,ms [API:I][12.865054] matmul desc create - no bias [CORE:I][12.879725] matmul desc init [matmul] [API:I][12.865072] matmul primitive_desc create - attr [PROF:I][12.879526] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.0017,ms [API:I][12.865086] matmul primitive create [CORE:I][12.879715] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.879720] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.864992] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.865460] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.467ms graph_exe_count=-1 weight_address=0x5123f5c0 [PROF:I][12.880031] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.492256,ms [API:I][12.865597] matmul desc create - no bias [CORE:I][12.880267] matmul desc init [matmul] [API:I][12.865610] matmul primitive_desc create - attr [PROF:I][12.880065] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00243,ms [API:I][12.865624] matmul primitive create [CORE:I][12.880255] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.880260] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.865534] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.865847] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.313ms graph_exe_count=-1 weight_address=0x5223f600 [PROF:I][12.880418] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.338691,ms [PROF:V0][12.865981] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.77783,ms [CORE:I][12.880406] CPU Stream deleted [stream] [CORE:I][12.880811] CPU Engine deleted [engine] [API:I][12.881004] Memory create [CORE:V0][12.880994] Memory desc init by tag [memory] [CORE:I][12.881002] Memory created [memory] [API:I][12.881025] Memory create - strides [CORE:I][12.881012] Memory desc init by Stride [memory] [CORE:I][12.881018] Memory created [memory] [API:I][12.881040] Memory create [CORE:V0][12.881026] Memory desc init by tag [memory] [CORE:I][12.881033] Memory created [memory] [API:I][12.880966] matmul desc create - no bias [CORE:I][12.880965] matmul desc init [matmul] [API:I][12.880985] matmul primitive_desc create - attr [PROF:I][12.880768] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.002851,ms [API:I][12.881002] matmul primitive create [CORE:I][12.880961] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.880965] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.866241] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.867964] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.724ms graph_exe_count=-1 weight_address=0x70d8a11f0040 [PROF:I][12.882536] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.75047,ms [API:I][12.882914] Memory create [CORE:V0][12.882901] Memory desc init by tag [memory] [CORE:I][12.882908] Memory created [memory] [API:I][12.882931] Memory create - strides [CORE:I][12.882919] Memory desc init by Stride [memory] [CORE:I][12.882924] Memory created [memory] [API:I][12.882948] Memory create [CORE:V0][12.882934] Memory desc init by tag [memory] [CORE:I][12.882939] Memory created [memory] [API:I][12.882868] matmul desc create - no bias [CORE:I][12.882874] matmul desc init [matmul] [API:I][12.882891] matmul primitive_desc create - attr [PROF:I][12.882672] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00178,ms [API:I][12.882904] matmul primitive create [CORE:I][12.882860] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.882865] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.868139] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.874548] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.409ms graph_exe_count=-1 weight_address=0x70d8a51f1040 [PROF:I][12.889139] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.4544,ms [API:I][12.889569] Memory create [CORE:V0][12.889559] Memory desc init by tag [memory] [CORE:I][12.889572] Memory created [memory] [API:I][12.889594] Memory create - strides [CORE:I][12.889583] Memory desc init by Stride [memory] [CORE:I][12.889588] Memory created [memory] [API:I][12.889611] Memory create [CORE:V0][12.889597] Memory desc init by tag [memory] [CORE:I][12.889606] Memory created [memory] [API:I][12.889634] Memory create [CORE:V0][12.889620] Memory desc init by tag [memory] [CORE:I][12.889627] Memory created [memory] [API:I][12.889654] Memory create [CORE:V0][12.889641] Memory desc init by tag [memory] [CORE:I][12.889647] Memory created [memory] [API:I][12.889586] matmul desc create - no bias [CORE:I][12.889585] matmul desc init [matmul] [API:I][12.889610] matmul primitive_desc create - attr [PROF:I][12.889398] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.00403,ms [API:I][12.889631] matmul primitive create [CORE:I][12.889591] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.889594] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.874868] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.881496] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.627ms graph_exe_count=-1 weight_address=0x70d8b31f2040 [PROF:I][12.896080] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.66636,ms [API:I][12.896505] Memory create [CORE:V0][12.896495] Memory desc init by tag [memory] [CORE:I][12.896505] Memory created [memory] [API:I][12.896528] Memory create - strides [CORE:I][12.896515] Memory desc init by Stride [memory] [CORE:I][12.896521] Memory created [memory] [API:I][12.896544] Memory create [CORE:V0][12.896530] Memory desc init by tag [memory] [CORE:I][12.896538] Memory created [memory] [API:I][12.896472] matmul desc create - no bias [CORE:I][12.896470] matmul desc init [matmul] [API:I][12.896496] matmul primitive_desc create - attr [PROF:I][12.896282] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00326,ms [API:I][12.896516] matmul primitive create [CORE:I][12.896476] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.896479] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.881753] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.887979] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.226ms graph_exe_count=-1 weight_address=0x70d8c11f3040 [PROF:I][12.902564] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.26617,ms [API:I][12.903057] Memory create [CORE:V0][12.903047] Memory desc init by tag [memory] [CORE:I][12.903057] Memory created [memory] [API:I][12.903079] Memory create - strides [CORE:I][12.903069] Memory desc init by Stride [memory] [CORE:I][12.903077] Memory created [memory] [API:I][12.903100] Memory create [CORE:V0][12.903086] Memory desc init by tag [memory] [CORE:I][12.903096] Memory created [memory] [API:I][12.903120] Memory create [CORE:V0][12.903108] Memory desc init by tag [memory] [CORE:I][12.903114] Memory created [memory] [API:I][12.903138] Memory create - strides [CORE:I][12.903126] Memory desc init by Stride [memory] [CORE:I][12.903133] Memory created [memory] [API:I][12.903157] Memory create [CORE:V0][12.903144] Memory desc init by tag [memory] [CORE:I][12.903150] Memory created [memory] [API:I][12.903175] Memory create [CORE:V0][12.903162] Memory desc init by tag [memory] [CORE:I][12.903168] Memory created [memory] [API:I][12.903190] Memory create - strides [CORE:I][12.903178] Memory desc init by Stride [memory] [CORE:I][12.903184] Memory created [memory] [API:I][12.903206] Memory create [CORE:V0][12.903191] Memory desc init by tag [memory] [CORE:I][12.903197] Memory created [memory] [API:I][12.888455] CPU Engine create [CORE:V0][12.903278] CPU Engine created [engine] [CORE:I][12.903284] CPU Engine created [cpu/engine] [API:I][12.888471] CPU Stream create [CORE:I][12.902894] CPU Stream created [stream] [CORE:V0][12.902894] CPU Stream created [cpu/stream] [API:I][12.888495] matmul desc create - no bias [CORE:I][12.903167] matmul desc init [matmul] [API:I][12.888517] matmul primitive_desc create - attr [PROF:I][12.902976] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00428,ms [API:I][12.888536] matmul primitive create [CORE:I][12.903168] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.903172] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.888447] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.890257] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.81ms graph_exe_count=-1 weight_address=0x70d86b1ea040 [PROF:I][12.904828] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.83688,ms [API:I][12.890394] matmul desc create - no bias [CORE:I][12.905066] matmul desc init [matmul] [API:I][12.890412] matmul primitive_desc create - attr [PROF:I][12.904867] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.0024,ms [API:I][12.890427] matmul primitive create [CORE:I][12.905057] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.905062] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.890336] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.890789] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.453ms graph_exe_count=-1 weight_address=0x532476c0 [PROF:I][12.905360] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.478966,ms [API:I][12.890923] matmul desc create - no bias [CORE:I][12.905594] matmul desc init [matmul] [API:I][12.890935] matmul primitive_desc create - attr [PROF:I][12.905388] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.00151,ms [API:I][12.890949] matmul primitive create [CORE:I][12.905579] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.905584] M: 1 N: 1024 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 1024 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.890857] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.891196] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=1024 lda=1 ldb=4096 ldc=1024 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=0.339ms graph_exe_count=-1 weight_address=0x54247700 [PROF:I][12.905767] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x1024:1x1024,0.364102,ms [PROF:V0][12.891330] zendnn_custom_op_execute,cpu,plugin_op:zentorch::zentorch_attn_qkv_fusion,num_ops:3,dims:,alg:mlp_parallel,2.875,ms [CORE:I][12.905755] CPU Stream deleted [stream] [CORE:I][12.906161] CPU Engine deleted [engine] [API:I][12.906340] Memory create [CORE:V0][12.906330] Memory desc init by tag [memory] [CORE:I][12.906337] Memory created [memory] [API:I][12.906360] Memory create - strides [CORE:I][12.906348] Memory desc init by Stride [memory] [CORE:I][12.906353] Memory created [memory] [API:I][12.906377] Memory create [CORE:V0][12.906363] Memory desc init by tag [memory] [CORE:I][12.906370] Memory created [memory] [API:I][12.906301] matmul desc create - no bias [CORE:I][12.906299] matmul desc init [matmul] [API:I][12.906316] matmul primitive_desc create - attr [PROF:I][12.906099] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,0.00261,ms [API:I][12.906333] matmul primitive create [CORE:I][12.906293] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.906298] M: 1 N: 4096 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.891571] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.893314] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=4096 lda=1 ldb=4096 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=1.742ms graph_exe_count=-1 weight_address=0x70d86f1eb040 [PROF:I][12.907885] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x4096:1x4096,1.76857,ms [API:I][12.908263] Memory create [CORE:V0][12.908250] Memory desc init by tag [memory] [CORE:I][12.908257] Memory created [memory] [API:I][12.908280] Memory create - strides [CORE:I][12.908267] Memory desc init by Stride [memory] [CORE:I][12.908272] Memory created [memory] [API:I][12.908294] Memory create [CORE:V0][12.908281] Memory desc init by tag [memory] [CORE:I][12.908288] Memory created [memory] [API:I][12.908219] matmul desc create - no bias [CORE:I][12.908218] matmul desc init [matmul] [API:I][12.908238] matmul primitive_desc create - attr [PROF:I][12.908018] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,0.00174,ms [API:I][12.908250] matmul primitive create [CORE:I][12.908207] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.908212] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.893486] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.899960] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.474ms graph_exe_count=-1 weight_address=0x70d8731ec040 [PROF:I][12.914549] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x14336:1x14336,6.51741,ms [API:I][12.914984] Memory create [CORE:V0][12.914973] Memory desc init by tag [memory] [CORE:I][12.914985] Memory created [memory] [API:I][12.915009] Memory create - strides [CORE:I][12.914997] Memory desc init by Stride [memory] [CORE:I][12.915003] Memory created [memory] [API:I][12.915026] Memory create [CORE:V0][12.915012] Memory desc init by tag [memory] [CORE:I][12.915020] Memory created [memory] [API:I][12.915046] Memory create [CORE:V0][12.915031] Memory desc init by tag [memory] [CORE:I][12.915037] Memory created [memory] [API:I][12.915066] Memory create [CORE:V0][12.915053] Memory desc init by tag [memory] [CORE:I][12.915060] Memory created [memory] [API:I][12.914997] matmul desc create - no bias [CORE:I][12.914997] matmul desc init [matmul] [API:I][12.915020] matmul primitive_desc create - attr [PROF:I][12.914806] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,0.0036,ms [API:I][12.915040] matmul primitive create [CORE:I][12.914999] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.915003] M: 1 N: 14336 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 14336 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.900281] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.906680] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=14336 lda=1 ldb=4096 ldc=14336 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.401ms graph_exe_count=-1 weight_address=0x70d8811ed040 [PROF:I][12.921269] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm_silu_mul,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_swish:1+binary_mul:f32:2 ,,1x4096:4096x14336:1x14336,6.44705,ms [API:I][12.921704] Memory create [CORE:V0][12.921693] Memory desc init by tag [memory] [CORE:I][12.921705] Memory created [memory] [API:I][12.921729] Memory create - strides [CORE:I][12.921718] Memory desc init by Stride [memory] [CORE:I][12.921723] Memory created [memory] [API:I][12.921747] Memory create [CORE:V0][12.921734] Memory desc init by tag [memory] [CORE:I][12.921742] Memory created [memory] [API:I][12.921679] matmul desc create - no bias [CORE:I][12.921677] matmul desc init [matmul] [API:I][12.921703] matmul primitive_desc create - attr [PROF:I][12.921490] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,0.00472,ms [API:I][12.921724] matmul primitive create [CORE:I][12.921685] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.921689] M: 1 N: 4096 K: 14336 transA: T transB: T lda: 1 ldb: 14336 ldc: 4096 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.906962] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.913164] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=14336 n=4096 lda=1 ldb=14336 ldc=4096 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=6.203ms graph_exe_count=-1 weight_address=0x70d88f1ee040 [PROF:I][12.927752] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x14336:14336x4096:1x4096,6.24444,ms [API:I][12.928193] Memory create [CORE:V0][12.928182] Memory desc init by tag [memory] [CORE:I][12.928193] Memory created [memory] [API:I][12.928215] Memory create - strides [CORE:I][12.928202] Memory desc init by Stride [memory] [CORE:I][12.928208] Memory created [memory] [API:I][12.928231] Memory create [CORE:V0][12.928219] Memory desc init by tag [memory] [CORE:I][12.928229] Memory created [memory] [API:I][12.928165] matmul desc create - no bias [CORE:I][12.928165] matmul desc init [matmul] [API:I][12.928191] matmul primitive_desc create - attr [PROF:I][12.927979] zendnn_primitive_create,cache_hit,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x32000:1x32000,0.0046,ms [API:I][12.928212] matmul primitive create [CORE:I][12.928173] zendnn_f32_matmul_t::execute_ref [CORE:V0][12.928178] M: 1 N: 32000 K: 4096 transA: T transB: T lda: 1 ldb: 4096 ldc: 32000 alpha: 1 beta: 0 batch: 1 Layout: CblasRowMajor(1) [PROF:V0][12.913454] Using AOCL GEMM API: aocl_gemm_f32f32f32of32 [PROF:V0][12.928772] zenMatMul_gemm auto_tuner=False Layout=CblasRowMajor, transa=CblasTrans, transb=CblasTrans, m=1 k=4096 n=32000 lda=1 ldb=4096 ldc=32000 alpha=1 beta=0 relu=0 gelu=0 algo_type=3 weight_caching=True Time=15.318ms graph_exe_count=-1 weight_address=0x70dc7cbc9040 [PROF:I][12.943364] zendnn_primitive_execute,cpu,plugin_op:zentorch::zentorch_mm,matmul,zendnn,undef,src_f32::blocked:ab:f0 wei_f32::blocked:ba:f0 dst_f32::blocked:ab:f0,,,1x4096:4096x32000:1x32000,15.3677,ms Traceback (most recent call last): File "/media/nvme2/docker/hf-transformers/test_llm_int8_v3.py", line 192, in main() File "/media/nvme2/docker/hf-transformers/test_llm_int8_v3.py", line 153, in main output = q_model.generate( ^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/transformers/generation/utils.py", line 2633, in generate result = self._sample( ^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/transformers/generation/utils.py", line 3617, in _sample outputs = model_forward(**model_inputs, return_dict=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 760, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 745, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1295, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1197, in codegen_and_compile compiled_fn = graph.compile_to_module().call ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2083, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2130, in _compile_to_module mod = PyCodeCache.load_by_key_path( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 2747, in load_by_key_path mod = _reload_python_module(key, path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py", line 36, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/torchinductor_linux_admin/sg/csgnp63kydnqrqjq7pxyvamsbo5g7fnhax3ntpewon4c7rf5vhts.py", line 12491, in async_compile.wait(globals()) File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 424, in wait self._wait_futures(scope) File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 445, in _wait_futures scope[key] = result.result() ^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 3224, in result return self.result_fn() ^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 2242, in future result = get_result() ^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 2050, in load_fn future.result() File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/concurrent/futures/_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/concurrent/futures/thread.py", line 59, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 2079, in _worker_compile_cpp cpp_builder.build() File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/cpp_builder.py", line 1601, in build run_compile_cmd(build_cmd, cwd=_build_tmp_dir) File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/cpp_builder.py", line 355, in run_compile_cmd _run_compile_cmd(cmd_line, cwd) File "/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/_inductor/cpp_builder.py", line 350, in _run_compile_cmd raise exc.CppCompileError(cmd, output) from e torch._inductor.exc.InductorError: CppCompileError: C++ compile error Command: /home/linux_admin/ls/envs/quark-zendnn-cpu/bin/x86_64-conda-linux-gnu-c++ /tmp/torchinductor_linux_admin/js/cjs5mavnshmwwcs4h6gs2p5j4npz63tfg6izrsbu54iert6qcpv4.cpp -D TORCH_INDUCTOR_CPP_WRAPPER -D STANDALONE_TORCH_HEADER -D C10_USING_CUSTOM_GENERATED_MACROS -D CPU_CAPABILITY_AVX2 -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -fopenmp -I/home/linux_admin/ls/envs/quark-zendnn-cpu/include/python3.12 -I/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/include -I/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -mavx2 -mfma -mf16c -D_GLIBCXX_USE_CXX11_ABI=1 -ltorch -ltorch_cpu -ltorch_python -lgomp -L/home/linux_admin/ls/envs/quark-zendnn-cpu/lib -L/home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/lib -o /tmp/torchinductor_linux_admin/js/cjs5mavnshmwwcs4h6gs2p5j4npz63tfg6izrsbu54iert6qcpv4.so Output: In file included from /home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/include/ATen/NumericUtils.h:7, from /tmp/torchinductor_linux_admin/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h:19, from /tmp/torchinductor_linux_admin/js/cjs5mavnshmwwcs4h6gs2p5j4npz63tfg6izrsbu54iert6qcpv4.cpp:2: /tmp/torchinductor_linux_admin/js/cjs5mavnshmwwcs4h6gs2p5j4npz63tfg6izrsbu54iert6qcpv4.cpp: In function 'void kernel(float*, const float*, const float*, const float*, const float*, const float*, const float*, const float*, const float*, const float*, const int64_t*, const int64_t*, float*, float*, float*, float*, float*, float*, float*, float*, float*, float*, float*, float*, int64_t, int64_t, int64_t)': /tmp/torchinductor_linux_admin/js/cjs5mavnshmwwcs4h6gs2p5j4npz63tfg6izrsbu54iert6qcpv4.cpp:234:62: error: 'tmp3' was not declared in this scope; did you mean 'tmp0'? 234 | TORCH_CHECK((at::vec::VecMask(tmp3 < at::vec::VectorizedN(ks2))).all_masked(), "index out of bounds: tmp3 < ks2"); | ^~~~ /home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/include/c10/macros/Macros.h:193:64: note: in definition of macro 'C10_UNLIKELY' 193 | #define C10_UNLIKELY(expr) (__builtin_expect(static_cast(expr), 0)) | ^~~~ /home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/include/c10/util/Exception.h:554:7: note: in expansion of macro 'C10_UNLIKELY_OR_CONST' 554 | if (C10_UNLIKELY_OR_CONST(!(cond))) { \ | ^~~~~~~~~~~~~~~~~~~~~ /tmp/torchinductor_linux_admin/js/cjs5mavnshmwwcs4h6gs2p5j4npz63tfg6izrsbu54iert6qcpv4.cpp:234:21: note: in expansion of macro 'TORCH_CHECK' 234 | TORCH_CHECK((at::vec::VecMask(tmp3 < at::vec::VectorizedN(ks2))).all_masked(), "index out of bounds: tmp3 < ks2"); | ^~~~~~~~~~~ /tmp/torchinductor_linux_admin/js/cjs5mavnshmwwcs4h6gs2p5j4npz63tfg6izrsbu54iert6qcpv4.cpp:263:37: error: 'tmp2' was not declared in this scope; did you mean 'tmp0'? 263 | TORCH_CHECK(tmp2 < ks2, "index out of bounds: tmp2 < ks2"); | ^~~~ /home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/include/c10/macros/Macros.h:193:64: note: in definition of macro 'C10_UNLIKELY' 193 | #define C10_UNLIKELY(expr) (__builtin_expect(static_cast(expr), 0)) | ^~~~ /home/linux_admin/ls/envs/quark-zendnn-cpu/lib/python3.12/site-packages/torch/include/c10/util/Exception.h:554:7: note: in expansion of macro 'C10_UNLIKELY_OR_CONST' 554 | if (C10_UNLIKELY_OR_CONST(!(cond))) { \ | ^~~~~~~~~~~~~~~~~~~~~ /tmp/torchinductor_linux_admin/js/cjs5mavnshmwwcs4h6gs2p5j4npz63tfg6izrsbu54iert6qcpv4.cpp:263:25: note: in expansion of macro 'TORCH_CHECK' 263 | TORCH_CHECK(tmp2 < ks2, "index out of bounds: tmp2 < ks2"); | ^~~~~~~~~~~ Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" [CORE:I][61.525394] CPU Stream deleted [stream] [CORE:I][61.526104] CPU Engine deleted [engine] (quark-zendnn-cpu) linux_admin@ai-server:/media/nvme2/docker/hf-transformers$