04387ntm a22005897i 4500 000716613 CZ-PrVSE 20240824093620.0 m d cr n|||||||||| 240824s2024 xr fsbm 000 0 eng d NEZPRACOVANÝ IMPORT ABA006 cze ABA006 ABA006 rda Balek, Vojtěch ISIS:155574 dis Velké jazykové modely jako nástroj pro extrakci rysů z textu eng Large Language Models as a tool for generating high-level features for text documents / Vojtěch Balek 2024 ?? stran : digital, PDF soubor Vedoucí práce: Tomáš Kliegr Bakalářská práce (Bc.)—Vysoká škola ekonomická v Praze. Fakulta informatiky a statistiky, 2024 Obsahuje bibliografii Textový (vysokoškolská kvalifikační práce) Rok obhajoby 2024 This bachelor thesis investigates the usability of large language models (LLMs) for feature generation from text, evaluating whether LLMs can produce interpretable and usable features for machine-learning tasks. The study uses two labeled datasets: the CORD-19 corpus, consisting of coronavirus research articles with binary labels for high and low citation count, and a dataset of scientific articles from Czech research institutions, with article scores assigned according to the M17+ methodology (ranging from 1 to 5). Seven categorical features were generated for each dataset using the LLama2 language model. These features were used to train models for binary and ordinal classification tasks. Performance was compared to baseline naive models and models trained on term frequency-inverse document frequency (TF-IDF) and sentence embeddings. In the CORD-19 dataset, models using LLM-generated features achieved an accuracy of 59.8%, outperforming the baseline dummy classifier (50.2%) but falling short of TF-IDF (62.5%) and sentence embeddings (62.5%). Combining LLM-generated features with article abstract and title texts using the AutoGluon platform achieved the highest accuracy (66.5%), followed by combining TF-IDF terms and LLM-generated features (65.3%). For the M17+ dataset, the model using LLM-generated features attained an accuracy of 37%, surpassing the naive classifier (18%) and TF-IDF (34.3%). Sentence embeddings achieved the highest accuracy (40.8%), while the AutoGluon model trained on abstract and title text achieved 39.5%. LLM-generated features enhanced the predictive performance of models and demonstrated higher interpretability compared to traditional bibliometric features. However, a notable limitation is the computational cost; generating features for small datasets (2000-3000 samples) requires tens of hours on high-end hardware. Způsob přístupu: Internet data analytics [obor bakal. práce] bakalářské práce fd132403 czenas bachelor's theses eczenas classification feature importance feature extraction interpretability large language models Kliegr, Tomáš ISIS:8484 ths Svátek, Vojtěch, 1967 prosinec 1.- mzk2004217940 opn Vysoká škola ekonomická v Praze. Fakulta informatiky a statistiky kn20010709399 dgg https://insis.vse.cz/zp/86858/podrobnosti VŠKP v InSIS https://insis.vse.cz/zp/86858 Hlavní práce https://insis.vse.cz/zp/86858/posudek/vedouci Hodnocení vedoucího https://insis.vse.cz/zp/86858/posudek/oponent/83889 Oponentura https://insis.vse.cz/zp/86858/priloha/29367 Přiloha k práci https://insis.vse.cz/zp/86858/priloha/29368 Přiloha k práci https://insis.vse.cz/zp/86858/priloha/29369 Přiloha k práci https://insis.vse.cz/zp/86858/priloha/29370 Přiloha k práci https://insis.vse.cz/zp/86858/priloha/29371 Přiloha k práci https://insis.vse.cz/zp/86858/priloha/29372 Přiloha k práci https://insis.vse.cz/zp/86858/podrobnosti dc:identifier NEPOSILAT VSKP vse86858 240823 86858