Forced browsing / finding hidden resources is one of the crucial part of any black-box web application security assessment. There are great tools to accomplish this task, but our favorite is DirBuster. Simple, fast & smart.
DirBuster ships with several wordlists, these wordlists generated via one big crawler which visited tons of websites, collected links and created most common directory / file names on the Internet. This is a really nice approach and DirBuster’s wordlists worked much better than any other wordlists out there.
However there is one fundamental problem with these wordlists. Whilst the purpose of these wordlists is finding hidden and not linked resources, ironically they are generated only from known and linked resources. To address this problem we came up with the idea of generating wordlists from open source code repositories. This way it would be possible to see all file/directory names and create much more useful wordlists.
We have extracted the directory structure and file names of many projects from Google Code and SourceForge to prepare a good wordlist for discovering hidden files/folders on a targeted web application.
We have sorted the words according to the their frequency count and prepared some lists based on this data.
Initially we needed to find lots of public SVN/CSV. So far we only used Google Code and Sourceforge. We did filtered search such as “Only PHP” or “Only ASP” projects. After this we used FSF (Freakin’ Simple Fuzzer) to scrape, it was a one liner.
After we had the list of all open source projects, we wrote couple of simple batch files to start getting list of files via SVN and CVS clients.
When all finished, we coded a small client to analyse the all repository outputs and load them into an SQL Server database. Later on we applied many filters with yet another small script and generated all these different wordlists to use in different scenarios.
It’s licensed under GPL, feel free to share and use your own GPL-Compatible application.