Пороговые показатели полноты и точности для оценки системы извлечения информации о товарах на основе эмбеддингов

Fedor V. Krasnov

doi:10.17323/2587-814X.2024.2.22.34

Fedor V. Krasnov Research Center of WB SK LLC, Moscow, Russia https://orcid.org/0000-0002-9881-7371

DOI: https://doi.org/10.17323/2587-814X.2024.2.22.34

Keywords: information retrieval, embedding-based retrieval, threshold metrics, semantic product search

Abstract

Modern product retrieval systems are becoming increasingly complex due to the use of extra product representations, such as user behavior, language semantics and product images. However, adding new information and complicating machine learning models does not necessarily lead to an improvement in online and business search performance, since after retrieval the product list is ranked, which introduces its own bias. Nevertheless, the business performance of a product search will be worse from ranking an incomplete list of products than a complete one, and the relevance of search results will not improve from perfect sorting of products that do not match the search query. Therefore, the main quality indicators for the products retrieval phase remain Recall and Precision at the k threshold. This paper compares several architectures of product retrieval systems in product search for e-commerce. To do this, the concepts of threshold Recall and Precision for information retrieval are investigated and the dependence of these measures on the order of issuance is revealed. An automatic procedure has been developed for calculating R@k and P@k, which allows us to compare the effectiveness of information retrieval systems. The proposed automatic procedure has been tested on the WANDS public dataset for several key architectures. The obtained values R@1000 = 84% ± 9% and P@10 = 67% ± 17% are at the level of SOTA models.

Downloads

Download data is not yet available.

References

Matveev M.G., Aleynikova N.A. Titova M.D. (2023) Decision support technology for a seller on a marketplace in a competitive environment. Business Informatics, vol. 17, no. 2, pp. 41–54. https://doi.org/10.17323/2587-814X.2023.2.41.54

Luo C., Goutam R., Zhang H., Zhang C., Song Y., Yin B. (2023) Implicit query parsing at Amazon product search. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/3539618.3591858

Linden G., Smith B., York J. (2003) Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, vol. 7, no. 1, pp. 76–80.

Huang P., He X., Gao J., Deng L., Acero A., Heck L. (2013) Learning deep structured semantic models for web search using clickthrough data. Proceedings of the 22nd ACM international conference on Information Knowledge Management, pp. 2333–2338. https://doi.org/10.1145/2505515.2505665

Nigam P., Song Y., Mohan V., Lakshman V., Ding W., Shingavi A., Teo C.H., Gu H., Yin B. (2019) Semantic Product Search. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2876–2885. https://doi.org/10.1145/3292500.3330759

Li S., Lv F., Jin T., Lin G., Yang K., Zeng X., Wu X., Ma Q. (2021) Embedding-based product retrieval in Taobao search. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 3181–3189. https://doi.org/10.1145/3447548.3467101

Krasnov F.V., Smaznevich I.S., Baskakova E.N. (2021) The problem of loss of solutions in the task of searching similar documents: Applying terminology in the construction of a corpus vector model. Business Informatics, vol. 15, no. 2, pp. 60–74. https://doi.org/10.17323/2587-814X.2021.2.60.74

Mitra B., Craswell N. (2017) Neural models for information retrieval. arXiv: 1705.01509. https://doi.org/10.48550/arXiv.1705.01509

Gudivada V.N., Rao D., Gudivada A.R. (2018) Information retrieval: concepts, models, and systems. Handbook of Statistics, vol. 38, pp. 331–401. https://doi.org//10.1016/bs.host.2018.07.009

Büttcher S., Clarke C.L.A., Cormack G.V. (2010) Information retrieval: Implementing and evaluating search engines. The MIT Press: Cambridge, Massachusetts, London, England.

Leonhardt J. (2023) Efficient and explainable neural ranking. PhD thesis. Hannover: Gottfried Wilhelm Leibniz Universität. https://doi.org/10.15488/15769

Campos D.F., Nguyen T., Rosenberg M., Song X., Gao J., Tiwary S., Majumder R., Deng L., Mitra B. (2016) MS MARCO: A human generated MAchine Reading COmprehension dataset. arXiv: 1611.09268. https://doi.org/10.48550/arXiv.1611.09268

Craswell N., Mitra B., Yilmaz E., Campos D., Voorhees E.M. (2020) Overview of the TREC 2019 deep learning track. arXiv: 2003.07820. https://doi.org/10.48550/arXiv.2003.07820

Leonhardt J., Müller H., Rudra K., Khosla M., Anand A., Anand A. (2023) Efficient neural ranking using forward indexes and lightweight encoders. ACM Transactions on Information Systems. https://doi.org/10.1145/3631939

Gao L., Dai Z., Chen T., Fan Z., Durme B.V., Callan J. (2021) Complement lexical retrieval model with semantic residual embeddings. Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science, vol. 12656, pp. 146–160. https://doi.org/10.1007/978-3-030-72113-8_10

Trotman A., Degenhardt J., Kallumadi S. (2017) The architecture of eBay search. SIGIR Workshop on eCommerce. eCOM@ SIGIR. Tokyo, Japan, August 2017.

Chang W., Jiang D., Yu H., Teo C.H., Zhang J., Zhong K., Kolluri K., Hu Q., Shandilya N., Ievgrafov V., Singh J., Dhillon I.S. (2021) Extreme multi-label learning for semantic matching in product search. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2643–2651. https://doi.org/10.1145/3447548.3467092

Krasnov F. (2023) Estimation of time complexity for the task of retrieval for identical products for an electronic trading platform based on the decomposition of machine learning models. International Journal of Open Information Technologies, vol. 11, no. 2, pp. 72–76.

Magnani A., Liu F., Chaidaroon S., Yadav S., Suram P.R., Puthenputhussery A., Chen S., Xie M., Kashi A., Lee T., Liao C. (2022) Semantic retrieval at Walmart. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington DC, USA, August 14–18, 2022, pp. 3495–3503. https://doi.org/10.1145/3534678.3539164

Gan Y., Ge Y., Zhou C., Su S., Xu Z., Xu X., Hui Q., Chen X., Wang Y., Shan Y. (2023) Binary embedding-based retrieval at Tencent. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach CA, USA, August 6–10, 2023, pp. 4056–4067. https://doi.org/10.1145/3580305.3599782

Jha R., Subramaniyam S., Benjamin E., Taula T. (2023) Unified embedding based personalized retrieval in Esty search. arXiv: 2306.04833. https://doi.org/10.48550/arXiv.2306.04833

Chen Y., Liu S., Liu Z., Sun W., Baltrunas L., Schroeder B. (2022) WANDS: Dataset for product search relevance assessment. Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol. 13185, pp. 128–141. https://doi.org/10.1007/978-3-030-99736-6_9

Embedding-based retrieval: measures of threshold recall and precision to evaluate product search

Abstract

Downloads

References