• gay_king_prince_charles [she/her, he/him]@hexbear.net
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    3 days ago

    Firstly, it is not fully up to date with the Internet (and training on data past 2022 risks poisoning the dataset).

    Where on earth did you get that from? Sonnet-4.5 has a pre-training cutoff date of January 2025 and GPT-5 has a pre-training cutoff date of October 2024. Any vaguely modern interface can get data past that into context by RAG and MCP. These aren’t far back because of model collapse or anything, it’s just that fine tuning is a hugely labor intensive process that takes months. Model collapse is greatly mitigated with human-based feedback and finetuning, making it safe to train models on LLM generated data. Deepseek, for example, is directly trained off GPT and Claude’s output.

    • sodium_nitride [she/her, any]@hexbear.net
      link
      fedilink
      English
      arrow-up
      3
      ·
      3 days ago

      I am aware that LLMs do train with datasets past 2022. But there is a risk of poisoning the dataset that will grow overt time as the use of LLMs becomes larger. It is not a risk that can be easily mitigated by human feedback and fine tuning, since getting rid of workers is exactly why business owners are hyped about LLMs in the first place.

      And yes, I did not about MCP so I was wrong about that part, but you can still put less data into context vs in training.