Brief aside: A bunch of shuffling has occurred in the forward-looking topics as we approach two years of The Lindahl Letter. Reworking the content for weeks 89 to 104 had to happen after the syllabus project. My focus and interest shifted a bit and due to that it made sense to go ahead and rework the pathing toward that extra special two-year anniversary of writing posts on Substack.
That brief aside is now complete. Some congratulations are in order, you made it to the post where I am going to write about academic paper mills and the future of synthetic papers flooding the academy. Using GPT-2 and a million-word corpus of my own words I trained a model to mimic my writing style. Unlike some of the newer iterations of those models you could tell it was not ready to pass as human generated. Initially, I had thought this post was going to be about the nature of how the peer review academic system of gatekeeping was failing compared to writing about the problematic possibility of synthetic writing being able to approach something that could pass the gatekeeping. Instead of going in that direction it was during this research that I realized the flooding problem of endless content creation was far worse than the breakdown of the academy-based gatekeeping system. Academic gatekeeping is a function of the quality of the gatekeepers and the rules they apply. That is inherent within the academy system, but it is being tested in a way that it has not had to endure before. Extreme oversupply of content is not going to slow down any time soon. To that end, I have spent a lot of time wondering about the future of publishing.
Large language models have created a scenario where a bit of prompt engineering can help generate blocks of prose. Previously I discussed a bit of the automation that is occurring within the instant news and financial reporting sections of the media. Using some type of model-based generation they take a bit of news and generate a story related to it and that can go out almost immediately. I have wondered about how many papers in the academic space get created in this way [1]. You can find examples of academics submitting papers to see if they can fool the reviewers into allowing them into journals [2]. Some scholars have taken this maybe a step too far and initially tried to publish fake papers [3]. I’m worried that flooding might occur within the world of academic publishing with fake journals and fake papers creating chaos.
Any field of academic study where a key journal exists and the academics within that field have a strong network and focus on the work in that journal or maybe a handful of key journals the system of academic publishing is probably still working well enough to unify the field. Within the field of machine learning things have broken down to the point where a lot of the content that I read is not from peer reviewed academic journals or prestigious conferences. I read a lot of preprints and things that people have shared. You could go through my entire independent introduction to machine learning syllabus and only really consume open access academic works [4]. The number of academic journals focused on machine learning is really (really) large and appears to be growing. That is one of the reasons that I really focused on citations to see trends and papers that are bubbling up to the top of active consideration. While you cannot totally trust citation counts as a metric of authority of ideas it is a solid way that can be used to gain single out of the noise that a paper might be worth reading.
If I was given a vote about things, then I would convene a regular conference cadence and associate a conference journal with it where submitted papers could be aggregated based on some peer review system of the conference attendees served as the gatekeeping system. That conference to journal system is probably my preferred method of journal aggregation as it is becoming community standard based. The people who want to be a part of it and read the journal are working together to uphold standards on the work they contribute to the academy. Right now, the opposite of that is occurring where people are defaulting back to reading preprints of papers and sometimes those preprints have more citations than the final location where the work is published. I’m pretty sure that based on the paywalls for some of the journals its entirely possible that the preprint reading rate is an order of magnitude larger. My preference here is keyed to building community vs. the totality of the contribution to the academy. I believe both elements are important and should be considered.
Links and thoughts:
“Everyone knows what YouTube is. Few know how it really works.”
“GTA leaks, TikTok search, and Apple reviews hotline”
Top 5 Tweets of the week:







Footnotes:
[1] https://www.nature.com/articles/d41586-021-00733-5
[2] https://undark.org/2020/11/26/fake-paper-predatory-journal/
[3] https://www.theatlantic.com/ideas/archive/2018/10/new-sokal-hoax/572212/
[4] https://github.com/nelslindahlx/Introduction-to-machine-learning-syllabus-2022
What’s next for The Lindahl Letter?
Week 89: That ML model is not an AGI
Week 90: What is probabilistic machine learning?
Week 91: What are ensemble ML models?
Week 92: National AI strategies revisited
Week 93: Papers critical of ML
I’ll try to keep the what’s next list forward looking with at least five weeks of posts in planning or review. If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.
Share this post