Translate a book writen in LaTeX from Slovenian into English¶
With permission of the author, we will demonstrate how to translate the book Euclidean Plane Geometry, written by Milan Mitrović from Slovenian into English, without modifying any of the LaTeX commands.
To achieve this, we will first split the book into chunks, each roughly a page long, then translate each chunk into English, and finally stitch them back together.
1. Read in the data¶
import openai
from transformers import GPT2Tokenizer
# OpenAI GPT-2 tokenizer is the same as GPT-3 tokenizer
# we use it to count the number of tokens in the text
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
with open("data/geometry_slovenian.tex", "r") as f:
text = f.read()
1485565
1.1 Count the tokens in each chunk¶
chunks = text.split('\n\n')
ntokens = []
for chunk in chunks:
ntokens.append(len(tokenizer.encode(chunk)))
max(ntokens)
Token indices sequence length is longer than the specified maximum sequence length for this model (1327 > 1024). Running this sequence through the model will result in indexing errors
1473
It turns out that a double newline is a good separator in this case, in order not to break the flow of the text. Also no individual chunk is larger than 1500 tokens. The model we will use is text-davinci-002, which has a limit of 4096 tokens, so we don't need to worry about breaking the chunks down further.
We will group the shorter chunks into chunks of around 1000 tokens, to increase the coherence of the text, and decrease the frequency of breaks within the text.
def group_chunks(chunks, ntokens, max_len=1000):
"""
Group very short chunks, to form approximately a page long chunks.
"""
batches = []
cur_batch = ""
cur_tokens = 0
# iterate over chunks, and group the short ones together
for chunk, ntoken in zip(chunks, ntokens):
cur_tokens += ntoken + 2 # +2 for the newlines between chunks
# if adding this chunk would exceed the max length, finalize the current batch and start a new one
if ntoken + cur_tokens > max_len:
batches.append(cur_batch)
cur_batch = chunk
else:
cur_batch += "\n\n" + chunk
batches.append(cur_batch)
return batches
chunks = group_chunks(chunks, ntokens)
len(chunks)
869
Notice that adding a sample untranslated and translated first command, where only the content of the chapter name needs to be translated, helps to get more consistent results.
The format of the prompt sent to the model consists of:
- A high level instruction to translate only the text, but not commands into the desired language
- A sample untranslated command, where only the content of the chapter name needs to be translated
- The chunk of text to be translated
- The translated sample command from 2, which shows the model the beginning of the translation process
The expected output is the translated chunk of text.
def translate_chunk(chunk, engine='text-davinci-002',
dest_language='English',
sample_translation=("\poglavje{Osnove Geometrije} \label{osn9Geom}", "\poglavje{The basics of Geometry} \label{osn9Geom}")
):
prompt = f'''Translate only the text from the following LaTeX document into {dest_language}. Leave all LaTeX commands unchanged
"""
{sample_translation[0]}
{chunk}"""
{sample_translation[1]}
'''
response = openai.Completion.create(
prompt=prompt,
engine=engine,
temperature=0,
top_p=1,
max_tokens=1500,
)
result = response['choices'][0]['text'].strip()
result = result.replace('"""', '') # remove the double quotes, as we used them to surround the text
return result
print(translate_chunk(chunks[800], engine='text-davinci-002', dest_language='English'))
Let $\mathcal{I}=\mathcal{S}_{AB} \circ\mathcal{S}_{CA} \circ\mathcal{S}_{BC}$. By \ref{izoZrcdrsprq} is $\mathcal{I}$ a mirror reflection. Let $A_1$, $B_1$ and $C_1$ be in order the center points of the lines $BC$, $AC$ and $AB$ of the triangle $ABC$. Because it is a right triangle is $\mathcal{I}(A_1C_1)=A_1C_1$, which means that the line $A_1C_1$ is of this mirror reflection. It is not difficult to prove that for the point $A'_1=\mathcal{I}(A_1)$ (both lie on the axis $A_1C_1$) is $\overrightarrow{A_1A'_1}=3\overrightarrow{A_1C_1}$, so $\mathcal{I}=\mathcal{G}_{3\overrightarrow{A_1C_1}}$. \item \res{Given are the points $A$ and $B$ on the same side of the line $p$. Draw the line $XY$, which lies on the line $p$ and is consistent with the given line $l$, so that the sum $|AX|+|XY|+|YB|$ is minimal.} Let $A'=\mathcal{G}_{\overrightarrow{MN}}(A)$ (where $M,N\in p$ and $MN\cong l$). The point $Y$ is obtained as the intersection of the lines $p$ and $X'Y$ (see also example \ref{HeronProbl}). \item \res{Let $ABC$ be an isosceles right triangle with a right angle at the vertex $A$. What does the composite $\mathcal{G}_{\overrightarrow{AB}}\circ \mathcal{G}_{\overrightarrow{CA}}$ represent?} Let $p$ and $q$ be the simetrali of the sides $CA$ and $AB$ of the triangle $ABC$. By \ref{izoZrcDrsKompSrOsn} is: $$\mathcal{G}_{\overrightarrow{AB}}\circ \mathcal{G}_{\overrightarrow{CA}}= \mathcal{S}_q\circ\mathcal{S}_A\circ\mathcal{S}_A\circ\mathcal{S}_p= \mathcal{S}_q\circ\mathcal{S}_p.$$ Because $ABC$ is an isosceles right triangle with a right angle at the vertex $A$, the lines $p$ and $q$ are perpendicular and intersect at the center $S$ of the hypotenuse $BC$. Therefore $\mathcal{G}_{\overrightarrow{AB}}\circ \mathcal{G}_{\overrightarrow{CA}}=\mathcal{S}_q \circ\mathcal{S}_p=\mathcal{S}_S$. \item \res{In the same plane are given the lines $a$, $b$ and $c$. Draw the points $A\in a$ and $B\in b$ so that $\mathcal{S}_c(A)=B$.}
We can see here that this one chunk in particular translates only the text, but leaves LaTeX commands intact.
Let's now translate all the chunks in the book - this will take 2-3 hours, as we're processing requests sequentially.
dest_language = "English"
translated_chunks = []
for i, chunk in enumerate(chunks):
print(str(i+1) + " / " + str(len(chunks)))
# translate each chunk
translated_chunks.append(translate_chunk(chunk, engine='text-davinci-002', dest_language=dest_language))
# join the chunks together
result = '\n\n'.join(translated_chunks)
# save the final result
with open(f"data/geometry_{dest_language}.tex", "w") as f:
f.write(result)
0 / 869 1 / 869 2 / 869 3 / 869 4 / 869 5 / 869 6 / 869 7 / 869 8 / 869 9 / 869 10 / 869 11 / 869 12 / 869 13 / 869 14 / 869 15 / 869 16 / 869 17 / 869 18 / 869 19 / 869 20 / 869 21 / 869 22 / 869 23 / 869 24 / 869 25 / 869 26 / 869 27 / 869 28 / 869 29 / 869 30 / 869 31 / 869 32 / 869 33 / 869 34 / 869 35 / 869 36 / 869 37 / 869 38 / 869 39 / 869 40 / 869 41 / 869 42 / 869 43 / 869 44 / 869 45 / 869 46 / 869 47 / 869 48 / 869 49 / 869 50 / 869 51 / 869 52 / 869 53 / 869 54 / 869 55 / 869 56 / 869 57 / 869 58 / 869 59 / 869 60 / 869 61 / 869 62 / 869 63 / 869 64 / 869 65 / 869 66 / 869 67 / 869 68 / 869 69 / 869 70 / 869 71 / 869 72 / 869 73 / 869 74 / 869 75 / 869 76 / 869 77 / 869 78 / 869 79 / 869 80 / 869 81 / 869 82 / 869 83 / 869 84 / 869 85 / 869 86 / 869 87 / 869 88 / 869 89 / 869 90 / 869 91 / 869 92 / 869 93 / 869 94 / 869 95 / 869 96 / 869 97 / 869 98 / 869 99 / 869 100 / 869 101 / 869 102 / 869 103 / 869 104 / 869 105 / 869 106 / 869 107 / 869 108 / 869 109 / 869 110 / 869 111 / 869 112 / 869 113 / 869 114 / 869 115 / 869 116 / 869 117 / 869 118 / 869 119 / 869 120 / 869 121 / 869 122 / 869 123 / 869 124 / 869 125 / 869 126 / 869 127 / 869 128 / 869 129 / 869 130 / 869 131 / 869 132 / 869 133 / 869 134 / 869 135 / 869 136 / 869 137 / 869 138 / 869 139 / 869 140 / 869 141 / 869 142 / 869 143 / 869 144 / 869 145 / 869 146 / 869 147 / 869 148 / 869 149 / 869 150 / 869 151 / 869 152 / 869 153 / 869 154 / 869 155 / 869 156 / 869 157 / 869 158 / 869 159 / 869 160 / 869 161 / 869 162 / 869 163 / 869 164 / 869 165 / 869 166 / 869 167 / 869 168 / 869 169 / 869 170 / 869 171 / 869 172 / 869 173 / 869 174 / 869 175 / 869 176 / 869 177 / 869 178 / 869 179 / 869 180 / 869 181 / 869 182 / 869 183 / 869 184 / 869 185 / 869 186 / 869 187 / 869 188 / 869 189 / 869 190 / 869 191 / 869 192 / 869 193 / 869 194 / 869 195 / 869 196 / 869 197 / 869 198 / 869 199 / 869 200 / 869 201 / 869 202 / 869 203 / 869 204 / 869 205 / 869 206 / 869 207 / 869 208 / 869 209 / 869 210 / 869 211 / 869 212 / 869 213 / 869 214 / 869 215 / 869 216 / 869 217 / 869 218 / 869 219 / 869 220 / 869 221 / 869 222 / 869 223 / 869 224 / 869 225 / 869 226 / 869 227 / 869 228 / 869 229 / 869 230 / 869 231 / 869 232 / 869 233 / 869 234 / 869 235 / 869 236 / 869 237 / 869 238 / 869 239 / 869 240 / 869 241 / 869 242 / 869 243 / 869 244 / 869 245 / 869 246 / 869 247 / 869 248 / 869 249 / 869 250 / 869 251 / 869 252 / 869 253 / 869 254 / 869 255 / 869 256 / 869 257 / 869 258 / 869 259 / 869 260 / 869 261 / 869 262 / 869 263 / 869 264 / 869 265 / 869 266 / 869 267 / 869 268 / 869 269 / 869 270 / 869 271 / 869 272 / 869 273 / 869 274 / 869 275 / 869 276 / 869 277 / 869 278 / 869 279 / 869 280 / 869 281 / 869 282 / 869 283 / 869 284 / 869 285 / 869 286 / 869 287 / 869 288 / 869 289 / 869 290 / 869 291 / 869 292 / 869 293 / 869 294 / 869 295 / 869 296 / 869 297 / 869 298 / 869 299 / 869 300 / 869 301 / 869 302 / 869 303 / 869 304 / 869 305 / 869 306 / 869 307 / 869 308 / 869 309 / 869 310 / 869 311 / 869 312 / 869 313 / 869 314 / 869 315 / 869 316 / 869 317 / 869 318 / 869 319 / 869 320 / 869 321 / 869 322 / 869 323 / 869 324 / 869 325 / 869 326 / 869 327 / 869 328 / 869 329 / 869 330 / 869 331 / 869 332 / 869 333 / 869 334 / 869 335 / 869 336 / 869 337 / 869 338 / 869 339 / 869 340 / 869 341 / 869 342 / 869 343 / 869 344 / 869 345 / 869 346 / 869 347 / 869 348 / 869 349 / 869 350 / 869 351 / 869 352 / 869 353 / 869 354 / 869 355 / 869 356 / 869 357 / 869 358 / 869 359 / 869 360 / 869 361 / 869 362 / 869 363 / 869 364 / 869 365 / 869 366 / 869 367 / 869 368 / 869 369 / 869 370 / 869 371 / 869 372 / 869 373 / 869 374 / 869 375 / 869 376 / 869 377 / 869 378 / 869 379 / 869 380 / 869 381 / 869 382 / 869 383 / 869 384 / 869 385 / 869 386 / 869 387 / 869 388 / 869 389 / 869 390 / 869 391 / 869 392 / 869 393 / 869 394 / 869 395 / 869 396 / 869 397 / 869 398 / 869 399 / 869 400 / 869 401 / 869 402 / 869 403 / 869 404 / 869 405 / 869 406 / 869 407 / 869 408 / 869 409 / 869 410 / 869 411 / 869 412 / 869 413 / 869 414 / 869 415 / 869 416 / 869 417 / 869 418 / 869 419 / 869 420 / 869 421 / 869 422 / 869 423 / 869 424 / 869 425 / 869 426 / 869 427 / 869 428 / 869 429 / 869 430 / 869 431 / 869 432 / 869 433 / 869 434 / 869 435 / 869 436 / 869 437 / 869 438 / 869 439 / 869 440 / 869 441 / 869 442 / 869 443 / 869 444 / 869 445 / 869 446 / 869 447 / 869 448 / 869 449 / 869 450 / 869 451 / 869 452 / 869 453 / 869 454 / 869 455 / 869 456 / 869 457 / 869 458 / 869 459 / 869 460 / 869 461 / 869 462 / 869 463 / 869 464 / 869 465 / 869 466 / 869 467 / 869 468 / 869 469 / 869 470 / 869 471 / 869 472 / 869 473 / 869 474 / 869 475 / 869 476 / 869 477 / 869 478 / 869 479 / 869 480 / 869 481 / 869 482 / 869 483 / 869 484 / 869 485 / 869 486 / 869 487 / 869 488 / 869 489 / 869 490 / 869 491 / 869 492 / 869 493 / 869 494 / 869 495 / 869 496 / 869 497 / 869 498 / 869 499 / 869 500 / 869 501 / 869 502 / 869 503 / 869 504 / 869 505 / 869 506 / 869 507 / 869 508 / 869 509 / 869 510 / 869 511 / 869 512 / 869 513 / 869 514 / 869 515 / 869 516 / 869 517 / 869 518 / 869 519 / 869 520 / 869 521 / 869 522 / 869 523 / 869 524 / 869 525 / 869 526 / 869 527 / 869 528 / 869 529 / 869 530 / 869 531 / 869 532 / 869 533 / 869 534 / 869 535 / 869 536 / 869 537 / 869 538 / 869 539 / 869 540 / 869 541 / 869 542 / 869 543 / 869 544 / 869 545 / 869 546 / 869 547 / 869 548 / 869 549 / 869 550 / 869 551 / 869 552 / 869 553 / 869 554 / 869 555 / 869 556 / 869 557 / 869 558 / 869 559 / 869 560 / 869 561 / 869 562 / 869 563 / 869 564 / 869 565 / 869 566 / 869 567 / 869 568 / 869 569 / 869 570 / 869 571 / 869 572 / 869 573 / 869 574 / 869 575 / 869 576 / 869 577 / 869 578 / 869 579 / 869 580 / 869 581 / 869 582 / 869 583 / 869 584 / 869 585 / 869 586 / 869 587 / 869 588 / 869 589 / 869 590 / 869 591 / 869 592 / 869 593 / 869 594 / 869 595 / 869 596 / 869 597 / 869 598 / 869 599 / 869 600 / 869 601 / 869 602 / 869 603 / 869 604 / 869 605 / 869 606 / 869 607 / 869 608 / 869 609 / 869 610 / 869 611 / 869 612 / 869 613 / 869 614 / 869 615 / 869 616 / 869 617 / 869 618 / 869 619 / 869 620 / 869 621 / 869 622 / 869 623 / 869 624 / 869 625 / 869 626 / 869 627 / 869 628 / 869 629 / 869 630 / 869 631 / 869 632 / 869 633 / 869 634 / 869 635 / 869 636 / 869 637 / 869 638 / 869 639 / 869 640 / 869 641 / 869 642 / 869 643 / 869 644 / 869 645 / 869 646 / 869 647 / 869 648 / 869 649 / 869 650 / 869 651 / 869 652 / 869 653 / 869 654 / 869 655 / 869 656 / 869 657 / 869 658 / 869 659 / 869 660 / 869 661 / 869 662 / 869 663 / 869 664 / 869 665 / 869 666 / 869 667 / 869 668 / 869 669 / 869 670 / 869 671 / 869 672 / 869 673 / 869 674 / 869 675 / 869 676 / 869 677 / 869 678 / 869 679 / 869 680 / 869 681 / 869 682 / 869 683 / 869 684 / 869 685 / 869 686 / 869 687 / 869 688 / 869 689 / 869 690 / 869 691 / 869 692 / 869 693 / 869 694 / 869 695 / 869 696 / 869 697 / 869 698 / 869 699 / 869 700 / 869 701 / 869 702 / 869 703 / 869 704 / 869 705 / 869 706 / 869 707 / 869 708 / 869 709 / 869 710 / 869 711 / 869 712 / 869 713 / 869 714 / 869 715 / 869 716 / 869 717 / 869 718 / 869 719 / 869 720 / 869 721 / 869 722 / 869 723 / 869 724 / 869 725 / 869 726 / 869 727 / 869 728 / 869 729 / 869 730 / 869 731 / 869 732 / 869 733 / 869 734 / 869 735 / 869 736 / 869 737 / 869 738 / 869 739 / 869 740 / 869 741 / 869 742 / 869 743 / 869 744 / 869 745 / 869 746 / 869 747 / 869 748 / 869 749 / 869 750 / 869 751 / 869 752 / 869 753 / 869 754 / 869 755 / 869 756 / 869 757 / 869 758 / 869 759 / 869 760 / 869 761 / 869 762 / 869 763 / 869 764 / 869 765 / 869 766 / 869 767 / 869 768 / 869 769 / 869 770 / 869 771 / 869 772 / 869 773 / 869 774 / 869 775 / 869 776 / 869 777 / 869 778 / 869 779 / 869 780 / 869 781 / 869 782 / 869 783 / 869 784 / 869 785 / 869 786 / 869 787 / 869 788 / 869 789 / 869 790 / 869 791 / 869 792 / 869 793 / 869 794 / 869 795 / 869 796 / 869 797 / 869 798 / 869 799 / 869 800 / 869 801 / 869 802 / 869 803 / 869 804 / 869 805 / 869 806 / 869 807 / 869 808 / 869 809 / 869 810 / 869 811 / 869 812 / 869 813 / 869 814 / 869 815 / 869 816 / 869 817 / 869 818 / 869 819 / 869 820 / 869 821 / 869 822 / 869 823 / 869 824 / 869 825 / 869 826 / 869 827 / 869 828 / 869 829 / 869 830 / 869 831 / 869 832 / 869 833 / 869 834 / 869 835 / 869 836 / 869 837 / 869 838 / 869 839 / 869 840 / 869 841 / 869 842 / 869 843 / 869 844 / 869 845 / 869 846 / 869 847 / 869 848 / 869 849 / 869 850 / 869 851 / 869 852 / 869 853 / 869 854 / 869 855 / 869 856 / 869 857 / 869 858 / 869 859 / 869 860 / 869 861 / 869 862 / 869 863 / 869 864 / 869 865 / 869 866 / 869 867 / 869 868 / 869