China Telecom has made a significant stride in the field of semantic large models by open sourcing the TeleChat-12B, a star semantic large model with 12 billion parameters. This move is part of a broader initiative, with plans to open source a large model with hundreds of billions of parameters within the year.
Compared to the 7B version open sourced in January, the TeleChat-12B has seen a 30% increase in overall effect in terms of content, performance, and applications. Notably, areas such as multi-round reasoning and security issues have seen improvements of more than 40%.
TeleChat-12B has upgraded the 1.5T training data of the 7B version to 3T. This upgrade, coupled with optimized data cleaning and annotation strategies, has greatly improved data quality. The model also continues to build special task SFT (supervised fine-tuning) data and optimize data construction specifications.
In terms of model structure, the TeleChat-12B has made significant improvements. Small-scale models were used to try combinations of multiple model structures to select the optimal structure. The TeleChat-12B model adopts a structure in which the word embedding layer and the output layer are decoupled, enhancing training stability and convergence.
The training data for TeleChat-12B covers a wide range of topics in both Chinese and English, including books, encyclopedias, news, government affairs, law, medicine, patents, papers, mathematics, and code. By optimizing the data cleaning strategy, the text cleanliness, unbiasedness, content validity, and format standardization of the data have been greatly improved.
China Telecom employs scientific data matching learning and course learning methods for training. Small parameter models are used to fit data with various data matching to obtain a priori estimates of the difficulty of each data set. During the training process, the model automatically evaluates the loss on all data sets and the generation effect on the evaluation set, dynamically increasing the weight of the more difficult to learn data sets.
China Telecom's open source provides basic models and dialogue models based on corresponding versions. It supports traditional full parameter updates as well as efficient fine-tuning methods such as LoRA that only update part of the parameters. It also supports Deepspeed fine-tuning, int8, int4 quantization, and domestic chips, promoting the localization process of large models.
With the unveiling of the TeleChat-12B, China Telecom is poised to make a significant impact in the field of semantic large models, offering enhanced performance and applications that promise to revolutionize the industry.
0 Comments