 Efficiency - Maple Help

Efficiency Improvements in Maple 13

Maple 13 includes numerous efficiency improvements.  Some highlights where dramatic speed-ups were achieved are described below. Linear systems over Q and over cyclotomic fields

 • A new dense p-adic lift algorithm has improved the efficiency of LinearAlgebra[LinearSolve] and LinearAlgebra[Modular][IntegerLinearSolve]. See Enhancements to LinearAlgebra for details.
 • Maple 13 includes a fast modular algorithm for solving linear systems over cyclotomic fields. For example, if a linear system involves $\mathrm{RootOf}\left(f,x\right)$ where $f\left(x\right)$ is a cyclotomic polynomial of degree > 1, then the fast code will be used. See also numtheory[cyclotomic].
 As a comparison, consider the following example:
 > r := RootOf(x^4+x^3+x^2+x+1);
 ${r}{≔}{\mathrm{RootOf}}{}\left({{\mathrm{_Z}}}^{{4}}{+}{{\mathrm{_Z}}}^{{3}}{+}{{\mathrm{_Z}}}^{{2}}{+}{\mathrm{_Z}}{+}{1}\right)$ (1)
 > cf := ()->randpoly(r, degree = 3):
 > nv := 50:
 > vars := seq(x[i],i=1..nv):
 > sys := {seq(randpoly([vars],degree=1,coeffs=cf),i=1..nv)}:
 > tt := time():
 > solve(sys,{vars}):
 > time()-tt;
 ${0.212}$ (2)
 In Maple 12, this same problem took almost 7 times as long. More LinearAlgebra efficiency improvements

Copying of a Matrix to a result with a shape (indexing function) is faster. The following Cholesky decomposition operation is approximately twenty times as fast as in Maple 12.

 > with(LinearAlgebra):
 > N := 1000: M := RandomMatrix(N,N,generator=-0.5..0.5, outputoptions=[datatype=float]):
 > L := Matrix(M,shape=triangular[lower]):
 > M := Matrix(L.Transpose(L)):
 > for i from 1 to N do M[i,i]:=i; end do:
 > for i from 1 to N do M[i,i] := i; end do:
 > st := time():
 > LUDecomposition(M,method=Cholesky):
 > time() - st;
 ${0.02}$ (3)

The following LU decomposition operation is approximately ten times as fast and requires only 80% of the memory allocation as compared to Maple 12.

 > with(LinearAlgebra):
 > N := 1000: M := RandomMatrix(N,N,generator=-0.5..0.5, outputoptions=[datatype=float]):
 > st := time():
 > LUDecomposition(M):
 > time()-st;
 ${0.03}$ (4)

The new thin option to SingularValues gives a result with only $\mathrm{min}\left(m,n\right)$ right and left singular vectors. This produces a marked speedup for LeastSquares solving when the number of rows is much larger than the number of columns. The following singular values decomposition operation is approximately thirty times as fast and requires only 4% of the memory allocation as compared to the non-thin computation available in Maple 12.

 > with(LinearAlgebra):
 > m := 10000: n := 100: M := RandomMatrix(m,n,generator=-1.0..1.0, outputoptions=[datatype=float]):
 > st := time():
 > SingularValues(M,output=[U,S,Vt],thin=true):
 > time()-st;
 ${1.78}$ (5) Modular multivariate polynomial multiplication and division

New heap-based modular multivariate polynomial multiplication and division algorithms have been added to Maple which are faster than the prior multiplication and division implementations.

This expansion has sped up by approximately 50%:

 > p1 := randpoly([x,y,z],degree=20,terms=50):
 > tt := time():
 > p2 := modp(Expand(p1 ^ 3),13):
 > time()-tt;
 ${0.020}$ (6)

This division has sped up by approximately 60%:

 > tt := time():
 > modp(Divide(p2,p1),13);
 ${\mathrm{true}}$ (7)
 > time()-tt;
 ${0.019}$ (8) New sparse modular algorithm for Greatest Common Divisors over algebraic number and function fields

A new algorithm has been added for evaluating Greatest Common Divisors over algebraic number and function fields, which is particularly effective for sparse polynomials. For example, this Gcd computation is about 95% faster:

 > alias(p=RootOf(z^3+t*z^2+s*t,z)):
 > alias(q=RootOf(z^2+2)):
 > g := (y+p*t+q)*(t*x+s*y*p+q):
 > a := t*x+s*y+p*s*t+p^2+q:
 > b := x+s*x*y-t*p^2+p+t+q*s:
 > A := evala(Expand(g*a)):
 > B := evala(Expand(g*b)):
 > tt := time():
 > evala(Gcd(A, B));
 > time()-tt;
 ${0.5}$ (9) Efficiency improvements to lcoeff and tcoeff

The efficiency of lcoeff and tcoeff has been improved. The following computation of the leading and trailing coefficients took more than twice as long and used considerably more memory in Maple 12:

 > vars := [seq(x[i],i=0..19)];
 ${\mathrm{vars}}{≔}\left[{{x}}_{{0}}{,}{{x}}_{{1}}{,}{{x}}_{{2}}{,}{{x}}_{{3}}{,}{{x}}_{{4}}{,}{{x}}_{{5}}{,}{{x}}_{{6}}{,}{{x}}_{{7}}{,}{{x}}_{{8}}{,}{{x}}_{{9}}{,}{{x}}_{{10}}{,}{{x}}_{{11}}{,}{{x}}_{{12}}{,}{{x}}_{{13}}{,}{{x}}_{{14}}{,}{{x}}_{{15}}{,}{{x}}_{{16}}{,}{{x}}_{{17}}{,}{{x}}_{{18}}{,}{{x}}_{{19}}\right]$ (10)
 ${f}{≔}{{x}}_{{0}}{+}{2}{}{{x}}_{{1}}{+}{3}{}{{x}}_{{2}}{+}{4}{}{{x}}_{{3}}{+}{5}{}{{x}}_{{4}}{+}{6}{}{{x}}_{{5}}{+}{7}{}{{x}}_{{6}}{+}{8}{}{{x}}_{{7}}{+}{9}{}{{x}}_{{8}}{+}{10}{}{{x}}_{{9}}{+}{11}{}{{x}}_{{10}}{+}{12}{}{{x}}_{{11}}{+}{13}{}{{x}}_{{12}}{+}{14}{}{{x}}_{{13}}{+}{15}{}{{x}}_{{14}}{+}{16}{}{{x}}_{{15}}{+}{17}{}{{x}}_{{16}}{+}{18}{}{{x}}_{{17}}{+}{19}{}{{x}}_{{18}}{+}{20}{}{{x}}_{{19}}$ (11)
 > g := expand(f^6):
 > (nops,length)(g);
 ${177100}{,}{10915266}$ (12)
 > t := time(): (lcoeff,tcoeff)(g, vars); time() - t;
 ${1}{,}{64000000}$
 ${0.008}$ (13)

Other improvements to lcoeff and tcoeff are described in Enhancements to Symbolic Capabilities in Maple 13. heap[extract] is more efficient

The heap[extract] function now performs fewer comparisons than in prior Maple versions. The number of comparisons performed in extracting all elements from an $n$ element heap for Maple 13 versus Maple 12 is provided in the following table:

 Number of Elements Maple 12 Maple 13 10 27 26 20 89 68 30 158 116 40 239 170 50 338 242 60 416 295 70 517 345 80 642 416 90 737 483 100 841 554 Many parts of the Maple evaluation engine have been updated to improve the performance when executing in parallel. Although there is still more work to be done, the performance in parallel is significantly faster.

 > p := proc(A::Vector,B::Vector,C::Vector,m::integer,n::integer)     option hfloat;     local i;     for i from m to n     do         C[i] := A[i]^2+B[i]^2;     end do; end proc: n := 10^7: m := n/2: A := LinearAlgebra:-RandomVector( n, outputoptions=[datatype=hfloat] ): B := LinearAlgebra:-RandomVector( n, outputoptions=[datatype=hfloat] ): C := Vector( 1..n, datatype=hfloat ):
 > tt := time[real](): p(A,B,C,1,m): p(A,B,C,m+1,n): time[real]() - tt;
 ${33.089}$ (14)
 > tt := time[real](): id1 := Threads:-Create( p(A,B,C,1,m) ): p(A,B,C,m+1,n): Threads:-Wait( id1 ): time[real]() - tt;
 ${22.873}$ (15)

In Maple 12, the parallel case takes 35.568 seconds, slightly longer than the single threaded case. Element-wise operations

The elementwise operator syntax can be more efficient than zip or map in some cases. Specific operations have been optimized compared to the general code used by zip or map. In other cases, depending on the input, built-in hardware floating-point routines can be used directly.

 > A := LinearAlgebra:-RandomMatrix(1000):
 > time(zip(*,A,2));
 ${0.608}$ (16)
 > time( A *~ 2 );
 ${0.096}$ (17)
 > B := LinearAlgebra:-RandomMatrix(1000,outputoptions=[datatype=float]):
 > time(zip(*,B,2));
 ${1.684}$ (18)
 > time( B *~ 2 );
 ${0.017}$ (19)
 > time(map(sin,B));
 ${5.456}$ (20)
 > time( sin~(B));
 ${0.116}$ (21) Improvements to GetSubImage in the ImageTools package

The GetSubImage routine of the ImageTools package is faster for the in-place case where the output is specified. The following repeated GetSubImage operation is approximately twice as fast as in Maple 12.

 > N := 500:
 > n := 100:
 > image := Array(1..N,1..N,datatype=float,order=C_order):
 > container := Array(1..n,1..n,datatype=float,order=C_order):
 > st := time():
 > for i from 1 to 1000 do   ImageTools:-GetSubImage(image,10,10,n,n,output=container); end do:
 > time() - st;
 ${0.220}$ (22) Decreased garbage collection frequency

In previous versions of Maple, a garbage collection cycle was initiated after allocating approximately 10^6 words of memory. Given the increase in the amount of memory available on current computers the frequency of garbage collection can be reduced without placing a significant burden on the memory subsystem. In Maple 13, the garbage collector is activated after allocating a minimum of 10^7 words. On average, this yields an improvement of 19% on 32-bit processors and about 10% for 64-bit processors with a modest increase in memory usage. The frequency of garbage collection can be tuned to achieve the best performance for your application by adjusting the gcfreq variable using the kernelopts mechanism. For example,

 > with(LinearAlgebra):
 > gcTest := proc()           local A, n, resultList;           for n to 30 do               A := RandomMatrix(n,n);               resultList:= LUDecomposition(A,output=['P','L','U1','R']);           end do; end proc:
 > kernelopts(gcfreq=10^6);
 $\left[{10000000}{,}{0.1}\right]$ (23)
 > gc(): start := time(): gcTest(): t1 := time() - start;
 ${\mathrm{t1}}{≔}{0.924}$ (24)
 > kernelopts(gcfreq=10^7);
 ${1000000}$ (25)
 > gc(): start := time(): gcTest(): t2:= time() - start;
 ${\mathrm{t2}}{≔}{0.860}$ (26)