The recent announcement to reject review articles and position papers already smelled like a shift towards a more "opinionated" stance, and this move smells worse.
The vacuum that arXiv originally filled was one of a glorified PDF hosting service with just enough of a reputation to allow some preprints to be cited in a formally published paper, and with just enough moderation to not devolve into spam and chaos. It has also been instrumental in pushing publishers towards open access (i.e., to finally give up).
Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.
In my view, arXiv fulfills its function better the less power it has as an institution, and I thus have exactly zero trust that the split from Cornell is driven by that function. We've seen the kind of appeasement prose from their statement and FAQ [1] countless times before, and it's now time for the usual routine of snapshotting the site to watch the inevitable amendments to the mission statement.
"What positive changes should users expect to see?" - I guess the negative ones we'll have to see for ourselves.
> Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right.
This has been a common practice in physics, especially the more theoretical branches, since the inception of arXiv. Senior researchers write a paper draft, and then send copies to some of their peers, get and incorporate feedback, and just submit to arxiv.
I came here to say something similar. As someone who works in a field that applies machine learning but is not purely focused on it, I interact with people who think that arXiv is the only relevant platform and that they don't need to submit their work to any journal, as well as people who still think that preprints don't count at all and that data isn't published until it's printed in an academic journal. It can feel like a clash of worlds.
I think both sides could learn from the other. In the case of ML, I understand the desire to move fast and that average time to publication of 250-300 days in some of the top-tier journals can feel like an unnecessary burden. But having been on both sides of peer review, there is value to the system and it has made for better work.
Not doing any of it follows the same spirit as not benchmarking your approach against more than maybe one alternative and that already as an after-thought. Or benchmaxxing but not exploring the actual real-world consequences, time and cost trade offs, etc.
Now, is academic publishing perfect? Of course not, very very far from it. It desperately needs to be reformed to keep it economically accessible, time efficient for both authors, editors and peer reviewers and to prevent the "hot topic of the day" from dominating journals and making sure that peer review aligns with the needs of the community and actually improves the quality of the work, rather than having "malicious peer review" to get some citations or pet peeves in.
Given the power that the ML field holds and the interesting experiments with open review, I would wish for the field to engage more with the scientific system at large and perhaps try to drive reforms and improve it, rather than completely abandoning it and treating a PDF hosting service as a journal (ofc, preprints would still be desirable and are important, but they can not carry the entire field alone).
> Unfortunately, over the years, arXiv has become something like a "venue" in its own right, ...
In my experience as a publishing scientist, this is partly because publishing with "reputable" journals is an increasingly onerous process, with exorbitant fees, enshittified UIs, and useless reviews. The alternative is to upload to arXiv and move on with your life.
> and with just enough moderation to not devolve into spam and chaos
arXiv has become a target for grifters in other domains like health and supplements. I’ve seen several small scale health influencers who ChatGPT some “papers” and then upload them to arXiv, then cite arXiv as proof of their “published research”. It’s not fooling anyone who knows how research work but it’s very convincing to an average person who thinks that that they’re doing the right thing when they follow sources that have done academic research.
I’ve been surprised as how bad and obviously grifty some of the documents I’ve seen on arXiv have become lately. Is there any moderation, or is it a free for all as long as you can get an invite?
This is great news for anyone building tools on top of arXiv data. The API (export.arxiv.org/api/) is one of the best free academic data sources — structured Atom feed with full abstracts, authors, categories, and publication dates.
I've been using it as one of 9 data sources in a market research tool — arXiv papers are a strong leading indicator of where an industry is heading. Academic research today often becomes commercial products in 2-3 years.
Bibliometrics reveal that they are highly cited. Internal data we had at arXiv 20 years ago show they are highly read. Reading review papers is a big part of the way you go from a civilian to an expert with a PhD.
On the other hand, they fall through the cracks of the normal methods of academic evaluation.
They create a lot of value for people but they are not likely to advance your career that much as an academic, certainly not in proportion to the value they create, or at least the value they used to create.
One of the most fun things I did on the way to a PhD was writing a literature review on giant magnetoresistance for the experimentalist on my thesis committee. I went from knowing hardly anything about the topic to writing a summary that taught him a lot he didn't know. Given any random topic in any field you could task me with writing a review paper and I could go out and do a literature search and write up a summary. An expert would probably get some details right that I'd get wrong, might have some insights I'd miss, but it's actually a great job for a beginner, it will teach you the field much more effectively than reading a review paper!
How you regulate review papers is pretty tricky. If it is original research the criterion of "is it original research" is an important limit. There might already be 25 review papers on a topic, but maybe I think they all suck (they might) and I can write the 26th and explain it to people the way I wish it was explained to me.
Now you might say in the arXiv age there was not a limit on pages, but LLMs really do problematize things because they are pretty good at summarization. Send one off on the mission to write a review paper and in some ways they will do better than I do, in other ways will do worse. Plenty of people have no taste or sense of quality and they are going to miss the latter -- hypothetically people could do better as a centaur but I think usually they don't because of that.
One could make the case that LLMs make review papers obsolete since you can always ask one to write a review for you or just have conversations about the literature with them. I know I could have spend a very long time studying the literature on Heart Rate Variability and eventually made up my mind about which of the 20 or so metrics I want to build into my application and I did look at some review papers and can highlight sentences that support my decisions but I made those decisions based on a few weekends of experiments and talking to LLMs. The funny thing is that if you went to a conference and met the guy who wrote the review paper and gave them the hard question of "I can only display one on my consumer-facing HRV app, which one do I show?" they would give you that clear answer that isn't in the review paper and maybe the odds are 70-80% that it will be my answer.
> Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.
This just isn't true. arXiv is not a venue. There's no place that gives you credit for arXiv papers. No one cares if you cite an arXiv paper or some random website. The vast vast majority of papers that have any kind of attention or citations are published in another venue.
My observation is that research, especially in AI has left universities, which are now focusing their research to a lesser degree on STEM. It appears research is now done by companies like Meta, OpenAI, Anthropic, Tencent, Alibaba, among many others.
> raised concerns about the proposed $300,000 salary for arXiv’s new CEO, saying it seemed high
Is a mid-to-high engineering salary outlandish for a CEO of what is likely to be a fairly major non-profit? Even non-profits have to be somewhat competitive when it comes to salary, and the ideal candidate is likely someone who would be balancing this against a tenured position at a major university
Considering the value and prominence of arxiv to the world, this seems low to me. Although more importantly the rest of the staff needs to be well paid too, and if that's the ceiling its a bit concerning. It's crazy to me that people thought this was too high.
I'm not sure why we're so focused on filtering what gets into arxiv (which is an uphill battle and DOA at this point) vs fixing the indexing, i.e. the page rank of academia.
Google "sorted out" a messy web with pagerank. Academic papers link to each others. What prevents us from building a ranking from there?
I'm conscious I might be over-simplifying things, but curious to see what I am missing.
I am of the same opinion, and ultimately ArXiv becoming a journal that can prevent one from publishing a paper — no matter how junk it is — would pretty much kill its purpose. But I suppose that now when flooding the interned with LLM-generated garbage is almost endorsed by some satanic people, it is pretty much a security issue to have some sort of filter on uploads.
Now, honestly, I have no idea why would one spend resources on uploading terabytes of LLM garbage to arXiv, but they sure can. Even if some crazy person is publishing like 2 nonsense papers daily, it is no harm and, if anything, valid data for psychology research. But if somebody actually floods it with non-human-generated content, well, I suppose it isn't even that expensive to make ArXiv totally unusable (and perhaps even unfeasible to host). So there has to be some filtering. But only to prevent the abuse.
Otherwise, I indeed think that proper ranking, linking and user-driven moderation (again, not to prevent anybody from posting anything, but to label papers as more interesting for the specific community) is the only right way to go.
Page rank was inspired by bibliometrics and evaluation of science publications. It's messed up now because of the rankings. Further fiddling with ranking will not fix the problem.
Going independent makes sense for arXiv. But the more interesting part is what it tells us about how we fund the stuff that actually keeps research moving.
arXiv runs on about seven million dollars a year and handles hundreds of thousands of papers. That's roughly twenty bucks a paper. This is the backbone of how physicists, computer scientists, and mathematicians share work. Traditional publishers charge thousands per article. The math is almost laughable. arXiv has never had an efficiency problem. The problem is that we've just accepted that something this important should survive on voluntary contributions and the occasional donation saving the day.
Look at what happened with bioRxiv and medRxiv when they spun off into openRxiv. That only happened about a year ago. Nobody knows yet if it actually works long-term or if it just kicks the money problems down the road. But both platforms, totally separately, came to the same conclusion. We need to leave the university. That says something. Universities aren't built to fund outside infrastructure forever. Their budgets follow enrollment, grants, and endowment performance. That doesn't line up with the steady, predictable funding arXiv needs to keep the lights on.
Ginsparg calling it a "Perils of Pauline" situation is probably the most honest thing anyone said about this. Everyone treats arXiv like it will always be there. But it's been one bad year away from serious trouble for most of its life. The real test for the nonprofit won't be the first few years. Cornell and Simons have that covered. It'll be five or ten years from now when the excitement fades and they're competing for donor money against whatever the next crisis in academic publishing turns out to be.
The worry about AI-generated junk is actually where independence could help. A university-hosted arXiv could only spend so much on moderation tools. An independent org with a focused mission can make that a real budget priority. Whether they can keep up with the flood of low-quality submissions is a different question entirely.
It's not that hard to make a mirror or arXiv. Basically, anybody who can pay for hosting (which, I suppose, isn't very cheap now when the whole world uses it). It's a problem to make users switch, because academia seems to have this weird tradition of resisting all practices that, god forbid, might improve global research capabilities and move forward the scientific progress. But then, if arXiv actually becomes unusable, I suppose they won't really have much choice than to switch?
And, FWIW, I do think that arXiv truly has a vast potential to be improved. It is currently in the position to change the whole process of how the research results are shared, yet it is still, as others have said, only a PDF hosting. And since the universities couldn't break out of the whole Elsevier & co. scam despite the internet existing for the 30 years, to me, breaking free from the university affiliation sounds like a good thing.
But, of course, I am talking only about the possibilities being out there. I know nothing about the people in charge of the whole endeavor, and ultimately in depends on them only, if it sails or sinks.
I am sure it’s a dumb idea but why is there a problem for say the National Science Foundation or something to run a website that replicates ArXiv - if you are from an accredited university or whatever you can publish papers, fulfilling the “pdf store” function.
Then getting peer reviewed is a harder process but one can see some form of credit on the site coming from doing a decent reviewers job.
From my limited experience, arXiv appears to include many low-quality, unreproducible papers, and some are straight-up self-marketing rather than serious scientific work.
This sounds terrible. Of course there's a huge risk of it becoming made for-profit. It almost makes you wonder if the academic publishers are behind this push somehow.
Could they not have made it into some legal structure that puts universities at the top? Say, with a bunch of universities owning shares that comprise the entirety of the ownership of arXiv, but that would allow arXiv to independently raise funds?
> Of course there's a huge risk of it becoming made for-profit.
The article says that "it will become an independent nonprofit corporation", and as OpenAI's failed attempt showed, converting a non-profit to a for-profit organization is either really hard or impossible.
> Could they not have made it into some legal structure that puts universities at the top?
As a corporation (even a non-profit one), it will have a board of directors. I have no idea what their charter will look like, but I would be surprised if at least one seat wasn't reserved for a university representative, and more than that seems quite likely as well.
ArXiv provides such an easy interface to navigate scientific papers, most are from computer science of course. Hope they can grow bigger and solve the paywall pain in open research. Any implication to Bioxiv?
What is worrisome about this development, and corollary actions like the hiring of a CEO with a $300,000/year salary, is that the essentially independent and community based platform will disappear. The ArXiv exists because mathematicians and physicists, and later computer scientists and engineers, posted there, freely, their work, with minimal attention to licensing and other commercial aspects. It has thrived because it required no peer review and made interesting things accessible quickly to whomever cared to read them.
A setup as a US-based "non-profit" is worrisome, if only because 300K is an obscene salary even in a for-profit setting. That the US-based posters can't see this is evidence of the basic problem which is that the US, both left and right, has been taken over by a neoliberal feudal antidemocratic nativist mindset that is anathema to the sort of free interchange of ideas that underlay the ArXiv's development in the hands of mathematicians and physicists now swept aside and ignored by machine learning grifters and technicians who program computers.
As a US based academic, I have to say when I saw the salary I immediately gawked. I think it's not americans but silicon valley-ites and tech bros on here who have lived with inflated salary/net worth that think it's just a middle of the road salary. As I regularly interact with friends in engineering who make like $200k + benefits ($), and I wonder why I don't jump ship to that weird land.
I fear their Mozilla-ification and Wikipedia-ification. Scope creep, various outreach feel-good programs, ballooning costs, lost focus etc. And other types of enshittification.
Any change to the basic premise will be a negative step.
They should just be boring quiet unopininionated neutral background infrastructure.
All the Mozilla executives have done for the last 15+ years is
* lay off developers
* spend lots of money on stupid side projects nobody asked for or wants
* increase their own salaries
and all that with the backdrop of falling quality, market share, and relevance.
I would happily donate to Firefox, but this fucked up organization will never see a single cent from me. They will spend it on anything but Firefox, which is the only thing anybody wants them to spend it on.
It might already be too late, and we will be left with a browser monopoly.
And they hired a LinkedIn business idiot to run the new organization - so the aim is for an infinite growth tech startup in terms of governance, despite the technical legal status of non-profit. It shows in the language they use in the announcement, too ("improved financial viability in the long run")
OpenAI shows exactly how well that works and what that kind of governance does to a company and to its support of science and the commons.
I've often thought that similar trust systems would work well in social media, web search, etc., but I've never seen it implemented in a meaningful way. I wonder what I'm missing.
Now the question is, will arxiv wage a decade long bloody war with Cornell, using heavy infantry (PhD students), archers (reviewers) and field artillery (AI slop papers), or will the independence be mostly peaceful? Only time can tell.
This is exactly what happened last time when scientific publishing got cornered. Journals run by departments and research groups were spun out or sold off to publishers and independent orgs. And they continued to slowly boil the frog over 50 years with fees and gate keeping.
Its especially problematic because while ArXiv love to claim to be working for open science, they don't default to open licensing. Much of the publications they host are not Open Access, and are only read access. So there is definitely the potential to close things off at some point in the future, when some CEO need to increase value.
arXiv is great. It's just a problem that there's so much slop. What if arXiv offered a subscription service that people in different fields could use to just see a curated selection of the top papers in their field each month. Established researchers in each field could then review some of the preprints for putting into the curated monthly list.
With 300K for the CEO, its enshittification will commence imminently. It will now serve to maximize revenue. Just wait and watch while they issue a premium membership, payment requirements for authors, and other revenue generators to please their investors.
they'll just turn into a shitty journal at this point, they just need to introduce peer review and they can start competing with the real journals on price point.
>Cornell, for example, had a limited capacity to pay software developers to maintain and upgrade the site, which still has a very no-frills look and feel.
I am not a software engineer, although I do write programs. What is it about digital infrastructure that requires maintenance? In the natural world, there is corrosion, thermal fluctuation, radiation, seismic activity, vandalism, whathaveyou. What are the issues facing the arxiv demanding the attention of multiple people 'round the clock?
"Recently arXiv’s growth has accelerated. Since 2022, it has expanded its staff to 27, in large part to deal with a 50% increase in submitted manuscripts."
I am wary of that. IMO the business model is damaged therein. You can say in 2022 we had 27; bankrupt in 2030.
The French government put a bit of money on the table to help researchers fulfil their open science requirements for government and EU grants, and funded the HAL repository ( https://hal.science/ ). It’s much smaller than arXiv, but it exists. In other countries like the UK there are clusters of smaller repositories as well, but it’s not as well centralised.
Frankly, the only beef I have with arXiv as is: its insistence on blocking AI access.
I had to tell my AI to set up an MCP for "fetch while bypassing arXiv's rate limit" so that it doesn't burn 40k tokens looking for workarounds every time it wants to look at a paper and gets hit with a "sorry, meatbags only" wall.
Very annoying, given how relevant arXiv papers are for ML specifically, and how many of papers there are. Can't "human flesh search" through all of them to pick the relevant ones for your work, and they just had to insist on making it harder for AIs to do it too.
Very unrelated to the article, but I think 'arXiv' as a brand is bad, and really detrimental to what the institution aims to accomplish.
That is, it's not readily parseable, it really gives an insider term vibe - like this isn't for you if you don't already know what it means or how you should read or say it. It sort of reminds me of the overuse of latin and latinate terms generally in the old professions and, well, the academy.
Just always struck me as being somewhat at odds with the goal.
I wonder what makes you feel that. I've been publishing preprints close to a decade on arxiv now and never had any particular feelings about it.
To me it's just a way to get out your work fast, so that there is already a trace of it on the Internets - nothing more and nothing less.
> That is, it's not readily parseable, it really gives an insider term vibe...
Isn't that normal with highly specialized research fields? I agree many papers could benefit from clearer wording, but working in a niche means you sometimes don't reach a broader audience
It's a classic story of someone having to pick a name quickly, which then gets established long before anyone who cares about branding is aware of its existence.
The original service didn't even have a name, only a description, and it was amusingly hosted at xxx.lanl.gov. But LANL wasn't really interested in it, and the founder eventually left for Cornell. At that point, the service needed a domain name, but archive.org was already taken.
And besides, the name has Ancient Greek influences. A similar Latinate term might be something like "archive".
The recent announcement to reject review articles and position papers already smelled like a shift towards a more "opinionated" stance, and this move smells worse.
The vacuum that arXiv originally filled was one of a glorified PDF hosting service with just enough of a reputation to allow some preprints to be cited in a formally published paper, and with just enough moderation to not devolve into spam and chaos. It has also been instrumental in pushing publishers towards open access (i.e., to finally give up).
Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.
In my view, arXiv fulfills its function better the less power it has as an institution, and I thus have exactly zero trust that the split from Cornell is driven by that function. We've seen the kind of appeasement prose from their statement and FAQ [1] countless times before, and it's now time for the usual routine of snapshotting the site to watch the inevitable amendments to the mission statement.
"What positive changes should users expect to see?" - I guess the negative ones we'll have to see for ourselves.
[1] https://tech.cornell.edu/arxiv/
> Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right.
This has been a common practice in physics, especially the more theoretical branches, since the inception of arXiv. Senior researchers write a paper draft, and then send copies to some of their peers, get and incorporate feedback, and just submit to arxiv.
I came here to say something similar. As someone who works in a field that applies machine learning but is not purely focused on it, I interact with people who think that arXiv is the only relevant platform and that they don't need to submit their work to any journal, as well as people who still think that preprints don't count at all and that data isn't published until it's printed in an academic journal. It can feel like a clash of worlds.
I think both sides could learn from the other. In the case of ML, I understand the desire to move fast and that average time to publication of 250-300 days in some of the top-tier journals can feel like an unnecessary burden. But having been on both sides of peer review, there is value to the system and it has made for better work.
Not doing any of it follows the same spirit as not benchmarking your approach against more than maybe one alternative and that already as an after-thought. Or benchmaxxing but not exploring the actual real-world consequences, time and cost trade offs, etc.
Now, is academic publishing perfect? Of course not, very very far from it. It desperately needs to be reformed to keep it economically accessible, time efficient for both authors, editors and peer reviewers and to prevent the "hot topic of the day" from dominating journals and making sure that peer review aligns with the needs of the community and actually improves the quality of the work, rather than having "malicious peer review" to get some citations or pet peeves in.
Given the power that the ML field holds and the interesting experiments with open review, I would wish for the field to engage more with the scientific system at large and perhaps try to drive reforms and improve it, rather than completely abandoning it and treating a PDF hosting service as a journal (ofc, preprints would still be desirable and are important, but they can not carry the entire field alone).
> arXiv fulfills its function better the less power it has as an institution
It is an interesting instance of the rule of least power, https://en.wikipedia.org/wiki/Rule_of_least_power.
> Unfortunately, over the years, arXiv has become something like a "venue" in its own right, ...
In my experience as a publishing scientist, this is partly because publishing with "reputable" journals is an increasingly onerous process, with exorbitant fees, enshittified UIs, and useless reviews. The alternative is to upload to arXiv and move on with your life.
> and with just enough moderation to not devolve into spam and chaos
arXiv has become a target for grifters in other domains like health and supplements. I’ve seen several small scale health influencers who ChatGPT some “papers” and then upload them to arXiv, then cite arXiv as proof of their “published research”. It’s not fooling anyone who knows how research work but it’s very convincing to an average person who thinks that that they’re doing the right thing when they follow sources that have done academic research.
I’ve been surprised as how bad and obviously grifty some of the documents I’ve seen on arXiv have become lately. Is there any moderation, or is it a free for all as long as you can get an invite?
This is great news for anyone building tools on top of arXiv data. The API (export.arxiv.org/api/) is one of the best free academic data sources — structured Atom feed with full abstracts, authors, categories, and publication dates.
I've been using it as one of 9 data sources in a market research tool — arXiv papers are a strong leading indicator of where an industry is heading. Academic research today often becomes commercial products in 2-3 years.
Review papers are interesting.
Bibliometrics reveal that they are highly cited. Internal data we had at arXiv 20 years ago show they are highly read. Reading review papers is a big part of the way you go from a civilian to an expert with a PhD.
On the other hand, they fall through the cracks of the normal methods of academic evaluation.
They create a lot of value for people but they are not likely to advance your career that much as an academic, certainly not in proportion to the value they create, or at least the value they used to create.
One of the most fun things I did on the way to a PhD was writing a literature review on giant magnetoresistance for the experimentalist on my thesis committee. I went from knowing hardly anything about the topic to writing a summary that taught him a lot he didn't know. Given any random topic in any field you could task me with writing a review paper and I could go out and do a literature search and write up a summary. An expert would probably get some details right that I'd get wrong, might have some insights I'd miss, but it's actually a great job for a beginner, it will teach you the field much more effectively than reading a review paper!
How you regulate review papers is pretty tricky. If it is original research the criterion of "is it original research" is an important limit. There might already be 25 review papers on a topic, but maybe I think they all suck (they might) and I can write the 26th and explain it to people the way I wish it was explained to me.
Now you might say in the arXiv age there was not a limit on pages, but LLMs really do problematize things because they are pretty good at summarization. Send one off on the mission to write a review paper and in some ways they will do better than I do, in other ways will do worse. Plenty of people have no taste or sense of quality and they are going to miss the latter -- hypothetically people could do better as a centaur but I think usually they don't because of that.
One could make the case that LLMs make review papers obsolete since you can always ask one to write a review for you or just have conversations about the literature with them. I know I could have spend a very long time studying the literature on Heart Rate Variability and eventually made up my mind about which of the 20 or so metrics I want to build into my application and I did look at some review papers and can highlight sentences that support my decisions but I made those decisions based on a few weekends of experiments and talking to LLMs. The funny thing is that if you went to a conference and met the guy who wrote the review paper and gave them the hard question of "I can only display one on my consumer-facing HRV app, which one do I show?" they would give you that clear answer that isn't in the review paper and maybe the odds are 70-80% that it will be my answer.
> Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.
This just isn't true. arXiv is not a venue. There's no place that gives you credit for arXiv papers. No one cares if you cite an arXiv paper or some random website. The vast vast majority of papers that have any kind of attention or citations are published in another venue.
My observation is that research, especially in AI has left universities, which are now focusing their research to a lesser degree on STEM. It appears research is now done by companies like Meta, OpenAI, Anthropic, Tencent, Alibaba, among many others.
> raised concerns about the proposed $300,000 salary for arXiv’s new CEO, saying it seemed high
Is a mid-to-high engineering salary outlandish for a CEO of what is likely to be a fairly major non-profit? Even non-profits have to be somewhat competitive when it comes to salary, and the ideal candidate is likely someone who would be balancing this against a tenured position at a major university
Salaries in the US are so bonkers. Everywhere else outside of the US, $300,000 is an outlandish high salary. To call it "mid to high" is insane.
Considering the value and prominence of arxiv to the world, this seems low to me. Although more importantly the rest of the staff needs to be well paid too, and if that's the ceiling its a bit concerning. It's crazy to me that people thought this was too high.
Yes, considering the workload and responsibility of the position.
Non-profits run into the problem of creating cushy jobs that just burn doner money.
Arxiv is basically a giant folder in the cloud and shouldnt have such high paying jobs. At least not if they want rational people to keep donating.
For anybody outside the SV, and especially outside the US, this seems high, yes.
arXiv does not need to and should not optimize for “shareholder value”, which is at least nominally the justification for outlandish CEO pay packages.
arXiv's CEO doesn't need to be a tenured professor equivalent it is a preprint repository ffs.
I'm not sure why we're so focused on filtering what gets into arxiv (which is an uphill battle and DOA at this point) vs fixing the indexing, i.e. the page rank of academia.
Google "sorted out" a messy web with pagerank. Academic papers link to each others. What prevents us from building a ranking from there?
I'm conscious I might be over-simplifying things, but curious to see what I am missing.
I am of the same opinion, and ultimately ArXiv becoming a journal that can prevent one from publishing a paper — no matter how junk it is — would pretty much kill its purpose. But I suppose that now when flooding the interned with LLM-generated garbage is almost endorsed by some satanic people, it is pretty much a security issue to have some sort of filter on uploads.
Now, honestly, I have no idea why would one spend resources on uploading terabytes of LLM garbage to arXiv, but they sure can. Even if some crazy person is publishing like 2 nonsense papers daily, it is no harm and, if anything, valid data for psychology research. But if somebody actually floods it with non-human-generated content, well, I suppose it isn't even that expensive to make ArXiv totally unusable (and perhaps even unfeasible to host). So there has to be some filtering. But only to prevent the abuse.
Otherwise, I indeed think that proper ranking, linking and user-driven moderation (again, not to prevent anybody from posting anything, but to label papers as more interesting for the specific community) is the only right way to go.
tangentially related: https://readabstracted.com/
Page rank was inspired by bibliometrics and evaluation of science publications. It's messed up now because of the rankings. Further fiddling with ranking will not fix the problem.
Statement by arXiv: https://tech.cornell.edu/arxiv/
Should be the main link. The original article is based on the CEO job posting.
Going independent makes sense for arXiv. But the more interesting part is what it tells us about how we fund the stuff that actually keeps research moving. arXiv runs on about seven million dollars a year and handles hundreds of thousands of papers. That's roughly twenty bucks a paper. This is the backbone of how physicists, computer scientists, and mathematicians share work. Traditional publishers charge thousands per article. The math is almost laughable. arXiv has never had an efficiency problem. The problem is that we've just accepted that something this important should survive on voluntary contributions and the occasional donation saving the day. Look at what happened with bioRxiv and medRxiv when they spun off into openRxiv. That only happened about a year ago. Nobody knows yet if it actually works long-term or if it just kicks the money problems down the road. But both platforms, totally separately, came to the same conclusion. We need to leave the university. That says something. Universities aren't built to fund outside infrastructure forever. Their budgets follow enrollment, grants, and endowment performance. That doesn't line up with the steady, predictable funding arXiv needs to keep the lights on. Ginsparg calling it a "Perils of Pauline" situation is probably the most honest thing anyone said about this. Everyone treats arXiv like it will always be there. But it's been one bad year away from serious trouble for most of its life. The real test for the nonprofit won't be the first few years. Cornell and Simons have that covered. It'll be five or ten years from now when the excitement fades and they're competing for donor money against whatever the next crisis in academic publishing turns out to be. The worry about AI-generated junk is actually where independence could help. A university-hosted arXiv could only spend so much on moderation tools. An independent org with a focused mission can make that a real budget priority. Whether they can keep up with the flood of low-quality submissions is a different question entirely.
It's not that hard to make a mirror or arXiv. Basically, anybody who can pay for hosting (which, I suppose, isn't very cheap now when the whole world uses it). It's a problem to make users switch, because academia seems to have this weird tradition of resisting all practices that, god forbid, might improve global research capabilities and move forward the scientific progress. But then, if arXiv actually becomes unusable, I suppose they won't really have much choice than to switch?
And, FWIW, I do think that arXiv truly has a vast potential to be improved. It is currently in the position to change the whole process of how the research results are shared, yet it is still, as others have said, only a PDF hosting. And since the universities couldn't break out of the whole Elsevier & co. scam despite the internet existing for the 30 years, to me, breaking free from the university affiliation sounds like a good thing.
But, of course, I am talking only about the possibilities being out there. I know nothing about the people in charge of the whole endeavor, and ultimately in depends on them only, if it sails or sinks.
I might be missing something, but I still don't get the why. I don't see any "problem" that needs to be solved.
The article lists the reasons quite clearly.
I think the problem described in 6th paragraph needs to be solved.
I am sure it’s a dumb idea but why is there a problem for say the National Science Foundation or something to run a website that replicates ArXiv - if you are from an accredited university or whatever you can publish papers, fulfilling the “pdf store” function.
Then getting peer reviewed is a harder process but one can see some form of credit on the site coming from doing a decent reviewers job.
I suspect I am missing a lot of nuance …
The moderation is difficult but not unprecedented.
I think NIST hosts the CVE repo (through a contract to MITRE)
https://youtu.be/4P5xSntVWQE
Given that Cornell charges what, $50k a year as an Ivy League, $300k feels like almost nothing.
This is going to be in NYC where $300k does not go as far as it does in Ithaca.
Heh, you might want to look up what they’re charging young people now.
From my limited experience, arXiv appears to include many low-quality, unreproducible papers, and some are straight-up self-marketing rather than serious scientific work.
This sounds terrible. Of course there's a huge risk of it becoming made for-profit. It almost makes you wonder if the academic publishers are behind this push somehow.
Could they not have made it into some legal structure that puts universities at the top? Say, with a bunch of universities owning shares that comprise the entirety of the ownership of arXiv, but that would allow arXiv to independently raise funds?
> Of course there's a huge risk of it becoming made for-profit.
The article says that "it will become an independent nonprofit corporation", and as OpenAI's failed attempt showed, converting a non-profit to a for-profit organization is either really hard or impossible.
> Could they not have made it into some legal structure that puts universities at the top?
As a corporation (even a non-profit one), it will have a board of directors. I have no idea what their charter will look like, but I would be surprised if at least one seat wasn't reserved for a university representative, and more than that seems quite likely as well.
I wonder if there are plans to licence the content for AI training
It's been available all along: https://info.arxiv.org/help/bulk_data.html
Id guess OAI & co have already copied without asking?
ArXiv provides such an easy interface to navigate scientific papers, most are from computer science of course. Hope they can grow bigger and solve the paywall pain in open research. Any implication to Bioxiv?
What is worrisome about this development, and corollary actions like the hiring of a CEO with a $300,000/year salary, is that the essentially independent and community based platform will disappear. The ArXiv exists because mathematicians and physicists, and later computer scientists and engineers, posted there, freely, their work, with minimal attention to licensing and other commercial aspects. It has thrived because it required no peer review and made interesting things accessible quickly to whomever cared to read them.
A setup as a US-based "non-profit" is worrisome, if only because 300K is an obscene salary even in a for-profit setting. That the US-based posters can't see this is evidence of the basic problem which is that the US, both left and right, has been taken over by a neoliberal feudal antidemocratic nativist mindset that is anathema to the sort of free interchange of ideas that underlay the ArXiv's development in the hands of mathematicians and physicists now swept aside and ignored by machine learning grifters and technicians who program computers.
As a US based academic, I have to say when I saw the salary I immediately gawked. I think it's not americans but silicon valley-ites and tech bros on here who have lived with inflated salary/net worth that think it's just a middle of the road salary. As I regularly interact with friends in engineering who make like $200k + benefits ($), and I wonder why I don't jump ship to that weird land.
I fear their Mozilla-ification and Wikipedia-ification. Scope creep, various outreach feel-good programs, ballooning costs, lost focus etc. And other types of enshittification.
Any change to the basic premise will be a negative step.
They should just be boring quiet unopininionated neutral background infrastructure.
> Mozilla-ification
All the Mozilla executives have done for the last 15+ years is
* lay off developers
* spend lots of money on stupid side projects nobody asked for or wants
* increase their own salaries
and all that with the backdrop of falling quality, market share, and relevance.
I would happily donate to Firefox, but this fucked up organization will never see a single cent from me. They will spend it on anything but Firefox, which is the only thing anybody wants them to spend it on.
It might already be too late, and we will be left with a browser monopoly.
> They should just be quiet unopininionated neutral background infrastructure.
Exactly. It should be a utility. Not quite dumb pipe, but not too far either.
Do research papers published on Elsevier's sort of media remain more prestigious?
I read a dozen papers a month, typically on arxiv, never from paywalled journals. I find the quality on par. But maybe I'm missing something.
And they hired a LinkedIn business idiot to run the new organization - so the aim is for an infinite growth tech startup in terms of governance, despite the technical legal status of non-profit. It shows in the language they use in the announcement, too ("improved financial viability in the long run")
OpenAI shows exactly how well that works and what that kind of governance does to a company and to its support of science and the commons.
TL;DR, it's fucked.
Maybe they should implement a graph based trust system:
You need your favourite academic gatekeeper (= thesis advisor) to vouch for you in order to be allowed to upload.
Then AI slop gets flagged and the shame spreads through the graph. And flaggings need to have evidence attached that can again be flagged.
They already had a basic form of this for a while [1]
> arXiv requires that users be endorsed before submitting their first paper to arXiv or a new category.
[1] https://info.arxiv.org/help/endorsement.html
The endorsement system already works along that line: https://info.arxiv.org/help/endorsement.html
It's probably not perfect but in practice, it seems to have been enough to get rid of the worst crackpotty spam.
You mean like endorsement? https://info.arxiv.org/help/endorsement.html
I've often thought that similar trust systems would work well in social media, web search, etc., but I've never seen it implemented in a meaningful way. I wonder what I'm missing.
Science reduced to people with a phd?
we got this before gta 6
Now the question is, will arxiv wage a decade long bloody war with Cornell, using heavy infantry (PhD students), archers (reviewers) and field artillery (AI slop papers), or will the independence be mostly peaceful? Only time can tell.
PhD students are levy infantry at best with Postdocs being the armoured levies.
This is exactly what happened last time when scientific publishing got cornered. Journals run by departments and research groups were spun out or sold off to publishers and independent orgs. And they continued to slowly boil the frog over 50 years with fees and gate keeping.
Its especially problematic because while ArXiv love to claim to be working for open science, they don't default to open licensing. Much of the publications they host are not Open Access, and are only read access. So there is definitely the potential to close things off at some point in the future, when some CEO need to increase value.
arXiv is great. It's just a problem that there's so much slop. What if arXiv offered a subscription service that people in different fields could use to just see a curated selection of the top papers in their field each month. Established researchers in each field could then review some of the preprints for putting into the curated monthly list.
Oh, wait.
With 300K for the CEO, its enshittification will commence imminently. It will now serve to maximize revenue. Just wait and watch while they issue a premium membership, payment requirements for authors, and other revenue generators to please their investors.
they'll just turn into a shitty journal at this point, they just need to introduce peer review and they can start competing with the real journals on price point.
another will need to rise to take its place.
.. and soon to be dependent on US military funding? Controlled by someone who has run-ins with universities? This'll end in tears.
>Cornell, for example, had a limited capacity to pay software developers to maintain and upgrade the site, which still has a very no-frills look and feel.
arXiv is doomed. It was nice while it lasted.
I am not a software engineer, although I do write programs. What is it about digital infrastructure that requires maintenance? In the natural world, there is corrosion, thermal fluctuation, radiation, seismic activity, vandalism, whathaveyou. What are the issues facing the arxiv demanding the attention of multiple people 'round the clock?
"Recently arXiv’s growth has accelerated. Since 2022, it has expanded its staff to 27, in large part to deal with a 50% increase in submitted manuscripts."
I am wary of that. IMO the business model is damaged therein. You can say in 2022 we had 27; bankrupt in 2030.
Good call, ArXiv seems like one of the most important institutions out there right now.
The French government put a bit of money on the table to help researchers fulfil their open science requirements for government and EU grants, and funded the HAL repository ( https://hal.science/ ). It’s much smaller than arXiv, but it exists. In other countries like the UK there are clusters of smaller repositories as well, but it’s not as well centralised.
It’s so important, in fact, that there should be more than one such institution.
People keep falling into the same trap. They love monopolies, then are shocked when those monopolies jerk them around.
it just hosts pdfs, no?
ArXiv is dead. Expect a paywall within three years, or other enshittification and slop added.
Maybe they'll do something like what Anna’s Archive did
Frankly, the only beef I have with arXiv as is: its insistence on blocking AI access.
I had to tell my AI to set up an MCP for "fetch while bypassing arXiv's rate limit" so that it doesn't burn 40k tokens looking for workarounds every time it wants to look at a paper and gets hit with a "sorry, meatbags only" wall.
Very annoying, given how relevant arXiv papers are for ML specifically, and how many of papers there are. Can't "human flesh search" through all of them to pick the relevant ones for your work, and they just had to insist on making it harder for AIs to do it too.
Very unrelated to the article, but I think 'arXiv' as a brand is bad, and really detrimental to what the institution aims to accomplish.
That is, it's not readily parseable, it really gives an insider term vibe - like this isn't for you if you don't already know what it means or how you should read or say it. It sort of reminds me of the overuse of latin and latinate terms generally in the old professions and, well, the academy.
Just always struck me as being somewhat at odds with the goal.
I wonder what makes you feel that. I've been publishing preprints close to a decade on arxiv now and never had any particular feelings about it.
To me it's just a way to get out your work fast, so that there is already a trace of it on the Internets - nothing more and nothing less.
> That is, it's not readily parseable, it really gives an insider term vibe...
Isn't that normal with highly specialized research fields? I agree many papers could benefit from clearer wording, but working in a niche means you sometimes don't reach a broader audience
It's a classic story of someone having to pick a name quickly, which then gets established long before anyone who cares about branding is aware of its existence.
The original service didn't even have a name, only a description, and it was amusingly hosted at xxx.lanl.gov. But LANL wasn't really interested in it, and the founder eventually left for Cornell. At that point, the service needed a domain name, but archive.org was already taken.
And besides, the name has Ancient Greek influences. A similar Latinate term might be something like "archive".
> like this isn't for you if you don't already know what it means
Isn't that actually kindof a good brand signal for a repo of very specialized papers? "Fun with learning" in comic sans wouldn't help credibility.
By your criterion, Google, Apple, and Amazon are terrible names as well.
This the type of guy that will suggest paper.ly as a better name with a straight face and then we wonder why the internet is turning to shit