Lies, Damned Lies, and Statistics
by Bob Patterson
On The Phrase Finder website, it is attributed to a speech made in New York in 1895 by Leonard H. Courtney,
(1832-1918), later Lord Courtney, that he said:
"After all, facts are facts, and although we may quote one to another with a chuckle the words of the
Wise Statesman, "Lies - damn lies - and statistics," still there are some easy figures the simplest
must understand, and the astutest cannot wriggle out of."
The statistics presented here are drawn from data recorded by the server at Mississippi State University. All files (pages and photographs) dispensed by the server are logged. The log files are used by two server statistics programs that produce very different reports. I have adjusted the reported page counts to better reflect utilization of the website by persons such as you and me. Much additional data is presented and discussed below.
For April, 2012, the server statistics package used by Mississippi State (Analog) reports that 595,203 pages were requested. The Analog package does not distinguish between pages viewed by you and me as opposed to pages requested by "robots," search engines such as Google that parse the pages for keywords to store in massive indexes. Mike Boone installed a more modern statistics package (Advanced Web Statistics or awstats) that does make this distinction. For April it reported 301,253 viewed pages (you and me) and 307,793 not-viewed (robots). After adjusting the data to reflect additional not-viewed pages that went undetected by awstats I show in the table below a "final" count of 249,101 pages. Adjustments made by me included estimating page counts for several days when the server provided no data, and eliminating from consideration the infrequent but exceptional downloading of pages by a small number of individuals. These adjustments are detailed and discussed at the very end of this report.
Note: There were 29 days in February, 2012.
In the table above, growth columns measure change from the same month in the previous year. Beginning in November of 2011 there appears to be a significant increase in usage of MPG. This was probably brought about by the addition of distribution maps to species page that began at that time.
A Typical Logfile Data Record
1. Requesting ISP | 76.21.150.170 |
2. Date and Time | [03/May/2012:00:23:18 -0500] |
3. File requested | GET /species.php?hodges=6241 HTTP/1.1 |
4. Status Code/Bytes | 200 6658 |
5. Referring page | http://mothphotographersgroup.msstate.edu/species.php?hodges=6240 |
6. Not used | |
7. Browser type info | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; GTB7.3; .NET CLR 1.1.4322) |
8. Destination ISP | mothphotographersgroup.msstate.edu |
That record is for a request that I made for MPG species page 6241, Euthyatira semicircularis. I made that request by clicking on the next button on species page 6240 (the referring page). The page was found and sent to me, earning a 200 status code for the transaction. The page size, or bandwidth consumed, was 6,658 bytes. I made the request at 01:23 (am, EDT) and this record was logged by the computer at Mississippi State at 00:23 (am, CDT) which is a 5-hour difference from Greenwich Mean Time. I made this request through my ISP which is also known as c-76.21.150.170.hsd1.md.comcast.net. That is the ISP name under which this transaction is shown in the server statistics reports. I can look up any ISP address using IP-Address,com. This is how I track down undetected robots that are mentioned further on in this report.
How The Internet Works
Depending upon your computer system, browser type, and the browser options that you have selected, the interaction between you and MPG goes something like the example I describe here. Before your browser can display a page for you, all the elements of the page must reside on your computer in a folder that (in Windows operating systems) is usually called "Temporary Internet Files." If you have visited a page previously, your browser (depending on browser options) may simply show you the page from your PC's cache (Temporary Internet Files) without sending a new request to MPG. If you are a first-time visitor of MPG your browser has no other option than to request the page. If you are a regular user of MPG your browser options may result in the transactions I now describe.
Web pages are text files containing HTML instructions understood by all web browsers. A file may have an extension or filetype of .htm, .html, .shtml, .php or some other type. In this case I requested MPG species page 6241 and the .php file for that page was returned by the server at Mississippi State to my browser in Maryland.
My browser "parsed" the file and sent back to the server a "conditional request" for five files needed to complete the presentation of the page. The condition was to send those five files only if the last change date of a file was more recent than the date given by my browser. The server did not send two of those files (status code 304), the other three were sent. My browser than showed me the page after retrieving the two "304" files from my PC's cache. If my Internet access service plan was for metered service (such as wireless service with a 3 gigabyte monthly limit) I would have been "charged" about 96,000 bytes for receiving this page.
In this instance my browser did not request the seven standard files that appear at the top of every MPG species page (Google logo, three buttons for last and next plate and species list, two menu boxes and the page header artwork). They were supplied to me from my PC's cache. Files already stored on my PC do not have to be requested again unless my browser options insist on checking for more recent files every time I view a page.
The awstats report will look at the logfile records and conclude that one page was viewed, and it will also count four "hits," and tally about 96,000 bytes of bandwidth. It may also add this transaction to one or more counters that attempt to summarize the number of visitors to the website.
Double Counting of Some Pages ?
As shown above, my ISP is Comcast.net in Maryland. For much of my work online I am also logged on to America Online (AOL). I can access the MPG website by either ISP, singly or in combination. In the example shown here I brought up the MPG plates menu from within AOL and issued a request for the books page. Comcast is shown by the log entry for 76.21.150.170. All of the entries beginning with 64.12 are for AOL. The first AOL entry is 64.12.116.12 and is also rendered as the ISP address cache-mtc-aa08.proxy.aol.com.
My browser parsed the books page and returned a request to transmit 12 image files including 10 book covers. These 12 files were requested to be returned to 11 different ISP identities. This is an example of "dynamic" web addressing. When I use AOL I might be listed under a different address each time I make a page request. This implies that I may be counted as several visitors.
Also notice that the request for the Books.shtml file seems to have been sent by both AOL and Comcast with both receiving normal status codes and byte counts. This implies to me that, at least under some conditions, a page might be counted twice. This does not appear to happen all the time in my case, and I don't know for certain that double-counting is taking place in such instances.
Robots - Search Engines - Referrers
Robots, also known as web bots, web crawlers and spiders, visit websites to collect information for Internet search services such as Google, Yahoo and MSN. There are many of them and they are operated from all over the world. The table below lists 37 of the 88 robots reported by awstats that visited MPG during 2011. The number is considerably larger than 88 because there are many robots included in the grouped entries seen at the top of the list. MPG was visited by more than 35 of these robots during all 12 months of 2011. Daily robot traffic is large, and there are more pages "not viewed" by robots than there are pages viewed by real people. Many photographs are also downloaded by robots and used by services such as Google Images.
Robot Name or Identifier | Months | Hits |
Unknown robot (identified by 'bot*') | 12 | 32,222,210 |
Unknown robot (identified by '*bot') | 12 | 859,289 |
Unknown robot (identified by 'discovery') | 12 | 693,497 |
Unknown robot (identified by 'spider') | 12 | 620,523 |
Unknown robot (identified by 'crawl') | 12 | 593,539 |
Unknown robot (identified by 'robot') | 12 | 167,093 |
Unknown robot (identified by empty user agent string) | 12 | 75,820 |
Unknown robot (identified by hit on 'robots.txt') | 12 | 261 |
Unknown robot (identified by 'checker') | 9 | 20 |
Unknown robot (identified by 'hunter') | 1 | 4 |
Alexa (IA Archiver) Amazon | 12 | 834 |
BaiDuSpider | 12 | 297,278 |
bender focused_crawler | 12 | 10,192 |
BSpider | 12 | 586 |
CFNetwork | 12 | 575 |
Exabot | 12 | 10,405 |
FaceBook bot | 12 | 9,245 |
Feedfetcher-Google | 12 | 4,995 |
GigaBot | 12 | 138 |
Googlebot | 12 | 861,427 |
ichiro | 12 | 2,776 |
Java (Often spam bot) | 12 | 4,728 |
larbin | 12 | 159 |
MJ12bot | 12 | 174,537 |
MSNBot-media | 12 | 15,318,466 |
Netcraft | 12 | 49 |
Nutch | 12 | 5,414 |
OutfoxBot/YodaoBot | 12 | 3,233 |
Perl tool | 12 | 325 |
Sogou Spider | 12 | 48,956 |
Speedy Spider | 12 | 180,698 |
Voila | 12 | 95 |
WebCollage | 12 | 12,494 |
WGet tools | 12 | 1,930 |
WordPress | 12 | 1,069 |
Yahoo Slurp | 12 | 3,373,500 |
Yandex bot | 12 | 2,303,324 |
Referrers
In April of 2012 thousands of visits to MPG pages were the result of people clicking on links provided by more than 28 search engines. The number of visits generated by Google (24,787 in awstats report) may be misleading. I have no doubt that the vast majority of those came about by people, who were already visiting MPG, making use of the search window at the top of MPG pages. However, this serves to illustrate the enormous value of a search engine from within MPG in addition to the importance in making people, anywhere in the world, aware of MPG.
This list shows some of the individuals and organizations who referred people to MPG in 2011. It is particularly gratifying to me to see the names of John Snyder and Marrku Savela on that list. In addition to his pages for South Carolina moths, John has maintained a listing of links to all North America species since before I got involved with moths in 2003. Markku has likewise maintained a very important list covering the entire world of lepidoptera that I have depended upon for years.
It would take considerable effort to produce a list of all external referrers during the course of a year. Links to MPG from other websites number in the thousands. This is one of the amazing benefits of Internet inter-connectivity. The entire world is within microseconds of our computers.
Monthly Activity Including Adjustments to Data
In the table below the number of visitors and visits have been computed by the awstats reports. They are based on that report's estimate of "viewed pages" which I have found necessary to adjust downward by up to 35%. I do not know if the counts for visitors and visits should be reduced similarly. Since many visitors are "lumped" by their ISP's, they cannot be counted accurately. The numbers shown here should be considered very rough estimates.
The first "Adjusted" column takes the number in Viewed PP and subtracts from it Adjust and Bump and adds to it Missed. The second Adjusted column takes the number in No-view PP and adds to it the number in Adjust. The two columns of adjusted data are used in the table presented at the top of this report.
Month | Missed | BUMP | Adjust | Visitors | Visits | Viewed PP | Adjusted | No-view PP | Adjusted |
2010-01 | | | 33,324 | 8,016 | 19,005 | 134,144 | 100,820 | 291,845 | 325,169 |
2010-02 | | 15,371 | 92,739 | 7,398 | 15,472 | 166,454 | 58,344 | 259,360 | 352,099 |
2010-03 | 3,500 | | 54,112 | 8,423 | 17,839 | 153,205 | 102,593 | 323,689 | 377,801 |
2010-04 | | | 56,134 | 11,144 | 20,936 | 177,095 | 120,961 | 333,494 | 389,628 |
2010-05 | | | 68,068 | 15,962 | 29,251 | 241,356 | 173,288 | 623,657 | 691,725 |
2010-06 | | | 52,114 | 17,021 | 29,296 | 243,592 | 191,478 | 532,535 | 584,649 |
2010-07 | | | 91,840 | 17,301 | 28,830 | 274,387 | 182,547 | 631,600 | 723,440 |
2010-08 | | 21,000 | 52,467 | 18,335 | 33,052 | 251,548 | 178,081 | 835,631 | 888,098 |
2010-09 | | 11,453 | 5,943 | 16,683 | 27,221 | 186,114 | 168,718 | 818,906 | 824,849 |
2010-10 | | | 7,788 | 11,710 | 21,392 | 138,491 | 130,703 | 515,669 | 523,457 |
2010-11 | 16,000 | | 35,062 | 8,813 | 15,544 | 120,705 | 101,643 | 336,380 | 371,442 |
2010-12 | 12,000 | | 30,721 | 6,917 | 14,198 | 113,603 | 94,882 | 309,406 | 340,127 |
2010 | 31,500 | 47,824 | 580,312 | | 272,036 | 2,200,694 | 1,604,058 | 5,812,172 | 6,392,484 |
|
Month | Missed | BUMP | Adjust | Visitors | Visits | Viewed PP | Adjusted | No-view PP | Adjusted |
2011-01 | | 25,000 | 27,801 | 7,511 | 17,021 | 169,806 | 117,005 | 338,196 | 365,997 |
2011-02 | | | 47,738 | 7,687 | 14,123 | 145,685 | 97,947 | 367,445 | 415,183 |
2011-03 | 4,000 | 24,366 | 27,068 | 8,637 | 18,486 | 175,636 | 128,202 | 394,300 | 421,368 |
2011-04 | | | 17,159 | 9,810 | 19,163 | 158,607 | 141,448 | 427,728 | 444,887 |
2011-05 | | | 16,141 | 12,193 | 23,680 | 199,329 | 183,188 | 385,423 | 401,564 |
2011-06 | | | 8,064 | 13,964 | 25,116 | 207,445 | 199,381 | 410,818 | 418,882 |
2011-07 | | | 13,665 | 16,063 | 28,246 | 260,058 | 246,393 | 251,588 | 265,253 |
2011-08 | | | 11,329 | 15,705 | 29,433 | 234,258 | 222,929 | 219,487 | 230,816 |
2011-09 | | | 10,959 | 14,086 | 27,322 | 227,772 | 216,813 | 236,093 | 247,052 |
2011-10 | | | 14,355 | 11,241 | 24,423 | 183,738 | 169,383 | 216,737 | 231,092 |
2011-11 | | | 7,199 | 10,457 | 22,641 | 174,249 | 167,050 | 221,945 | 229,144 |
2011-12 | | | 5,170 | 8,670 | 19,436 | 150,774 | 145,604 | 215,077 | 220,247 |
2011 | 4,000 | 49,366 | 206,648 | | 269,090 | 2,287,357 | 2,035,343 | 3,684,837 | 3,891,485 |
|
Month | Missed | BUMP | Adjust | Visitors | Visits | Viewed PP | Adjusted | No-view PP | Adjusted |
2012-01 | | | 10,006 | 9,380 | 22,049 | 163,676 | 153,670 | 279,583 | 289,589 |
2012-02 | | 29,786 | 25,047 | 9,110 | 23,396 | 224,637 | 169,804 | 273,531 | 298,578 |
2012-03 | 6,500 | | 17,246 | 10,923 | 26,054 | 222,528 | 211,782 | 290,509 | 307,755 |
2012-04 | | 38,278 | 13,874 | 14,174 | 29,772 | 301,253 | 249,101 | 307,793 | 321,667 |
2012-05 | | | 23,307 | 17,969 | 36,013 | 324,285 | 300,978 | 381,112 | 404,419 |
2012-06 | | 23,000 | 28,960 | 17,656 | 34,991 | 362,367 | 310,407 | 371,632 | 400,592 |
2012-07 | | 20,381 | 34,167 | 18,322 | 38,754 | 408,430 | 353,882 | 431,887 | 466,054 |
2012-08 | | 16,000 | 36,012 | 17,920 | 36,065 | 484,547 | 332,535 | 390,911 | 526,923 |
2012-09 | | | 22,146 | 18,554 | 43,337 | 304,552 | 282,406 | 429,481 | 451,627 |
2012-10 | | | 20,471 | 17,384 | 37,281 | 273,621 | 253,150 | 532,984 | 553,455 |
2012-11 | | | 19,065 | 13,611 | 30,854 | 224,655 | 205,590 | 464.490 | 483,555 |
2012-12 | | 77,000 | 28,163 | 11,757 | 30,580 | 299,703 | 194,540 | 467,695 | 572,858 |
2012 | 6,500 | 204,445 | 378,464 | | 389,146 | 3,594,254 | 3,017,845 | 4,621,608 | 5,077,601 |
|
Month | Missed | BUMP | Adjust | Visitors | Visits | Viewed PP | Adjusted | No-view PP | Adjusted |
2013-01 | | 70,000 | 10,125 | 11,595 | 35,782 | 300,681 | 220,555 | 710,875 | 791,000 |
2013-02 | | 40,707 | 21,772 | 10,542 | 28,984 | 251,955 | 189,476 | 1,259,281 | 1,321,760 |
2013-03 | 7,200 | 15,201 | 20,241 | 13,235 | 31,778 | 251,087 | 222,845 | 2,280,210 | 2,308,452 |
2013-04 | | 3,844 | 19,681 | 19,513 | 40,850 | 316,601 | 293,076 | 3,455,526 | 3,479,051 |
2013-05 | | 35,509 | 4,381 | 20,686 | 44,036 | 352,148 | 312,258 | 2,520,061 | 2,559,951 |
2013-06 | | | 19,615 | 21,190 | 45,054 | 377,832 | 355,217 | 2,359,677 | 2,379,292 |
2013-07 | | | 12,308 | 26,034 | 52,528 | 469,745 | 457,437 | 1,557,221 | 1,569,529 |
2013-08 | | | 39,908 | 27,257 | 59,201 | 512,563 | 472,655 | 2,058,024 | 2,097,932 |
2013-09 | | | 44,706 | 28,752 | 60,347 | 451,900 | 407,494 | 1,916,322 | 1,961,728 |
2013-10 | | | 37,114 | 22,055 | 50,614 | 402,351 | 365,237 | 2,361,702 | 2,398,816 |
2013-11 | | | 25,585 | 16,901 | 39,409 | 303,498 | 277,913 | 3,157,638 | 3,183,223 |
2013-12 | 7,000 | | 10,730 | 14,965 | 32,930 | 287,329 | 276,599 | 3,938,480 | 3,949,210 |
Exceptional Downloading of Large Blocks of Pages
Large numbers of pages are downloaded from time to time with much of this activity taking place from outside North America. In the table below are entries for such activity from Austria, Great Britain (twice), Iran, Netherlands and Russia. This has also happened several times from within North America. In these cases the page counts are disregarded.
It is reasonable to think that the North American activity represented that of heavy users of MPG who will, in the future, be able to severely reduce their visits. Should this "lost business" go unaccounted for? It might be thought acceptable to use an accrual accounting method for these large blocks of pages. The block of 30,000 pages download in early April of 2012 could be counted as 2,500 pages per month over a twelve month period.
| Major Abnormalities Noted in Log Files -- Adjusted in Previous Table |
2010-01 | 26,000 page views by robots from Korea affect many days. |
2010-02 | 5,288 pages crawled by Scoutjet robot (also in other months). |
2010-02 | 8,225 pages were linked from another website. |
2010-02 | 52,000 page views by robots from Korea affect many days. |
2010-02 | 15,371 pages were viewed/downloaded by a Russian address on 18-19 Feb. |
2010-03 | 39,000 page views by robots from Korea affect many days. |
2010-03 | No statistics have been made available by the server for 13-Mar. 3,500 pages are estimated for that date. |
2010-04 | 46,000 page views by robots from Korea affect many days. |
2010-05 | 60,000 page views by robots from Korea affect many days. |
2010-06 | 38,000 page views by robots from Korea affect many days. |
2010-07 | 80,000 page views by robots from Korea affect many days. |
2010-08 | 10,000 page views via cable provider in southwestern Canada remains unexplained. |
2010-08 | Spike of 13,000 pages over several dates was caused by a robot from Korea. |
2010-08 | 11,000 pages were viewed/downloaded by an address in Iran on 14-Aug. |
2010-09 | 11,453 pages were viewed/downloaded by a UK address on 18-Sep. |
2010-11 | Spikes totalling 16,000 pages were caused by several robots. |
2010-11 | No statistics have been made available by the server for four days. 16,000 pages are estimated for those days. |
2010-12 | Spike of 13,000 pages on 6-Dec was probably caused by a robot from Korea. |
2010-12 | No statistics have been made available by the server for four days. 12,000 pages are estimated for those days. |
2011-01 | Spike of 25,000 pages on 21-Jan was caused by a visitor from Netherlands. |
2011-01 | Spike of 9,000 pages on 22-Jan was probably caused by a robot from Korea. |
2011-01 | Spike of 9,000 pages on 28-Jan was probably caused by a robot from Korea. |
2011-02 | Spike of 13,000 pages on 5-Feb was probably caused by a robot from Korea. |
2011-02 | Spike of 17,000 pages on 14-Feb was probably caused by a robot from Korea. |
2011-03 | 5,336 pages downloaded by an Austrian address may account for a spike in usage on 2-Mar. |
2011-03 | A spike of about 5,000 pages on 3-Mar was probably caused by a robot from Korea. |
2011-03 | A Bellsouth customer downloaded about 19,000 pages on 24-Mar. |
2011-03 | No statistics have been made available by the server for 12-Mar. 4,000 pages are estimated for that date. |
2011-05 | A spike of about 6,000 pages on 21-May was probably caused by a robot from Korea. |
2012-01 | 2,560 pages recorded under 149.168.27.243 were viewed from a State of North Carolina address. |
2012-02 | 29,786 pages viewed/downloaded by a UK address on 4-Feb, nearly all species pages and maps. |
2012-03 | No statistics have been made available by the server for 10-Mar. 6,500 pages are estimated for that date. |
2012-04 | About 30,000 pages are recorded by a single user on 2-Apr & 3-Apr. All species pages and maps. |
2012-06 | About 23,000 pages recorded on 13-Jun., possibly by same user noted in April. |
2012-07 | 20,381 pages were viewed/downloaded by an address in Iran on 14/15-Jul. See August, 2010 also. |
2012-08 | About 16,000 pages were viewed/downloaded from an ISP in southeastern Ohio (Ironton). |
2012-08 | About 111,000 pages were downloaded from Ukraine and 21,000 from several countries in Asia. |
2012-12 | About 77,000 pages were downloaded by three users from ISPs in Michigan, New Jersey and Ohio. |
2013-01 | About 70,000 pages were downloaded by two users from ISPs in Ohio and Texas. |
2013-02 | About 41,000 pages were downloaded by a single user in Ohio. |
2013-03 | No data were available from the server for 10 Mar. 7,200 pages were estimated for that date. |
2013-03 | About 15,200 pages were downloaded by a single user in Great Britain. |
2013-04 | About 3,844 pages were downloaded by a single user in Great Britain. |
2013-04 | 1,327,452 hits (combined pages and photographs) were downloaded during 11 sessions by users of HTTrack Off-line Browser software. Much or all of this activity was by Ken Childs and Bob Patterson who now have the entire website on their personal computers. Instructions for doing this (by PC users, not Macs) will be presented below. |
2013-05 | About 35,509 pages were downloaded by a user at a Bellsouth ISP. |
2013-12 | 7,000 pages were estimated for December 21 when the server gathered no statistics. |
Downloading for Off-line Use of MPG
In 2009 I spent many days at the U. S. National Museum selecting specimens of tortricid moths to be photographed. I needed to be able to view the MPG pinned specimen plates to determine "needed" species, those for which we needed better photographs, and cases where additional variants should be shown. Because the USNM is not equiped with Wi-Fi Internet access I needed to bring the MPG tortricid plates with me.
I downloaded those plates to my laptop (about 23 pages). During 20 visits to the museum I probably looked at 10 pages per visit, or 200 pages. The MPG website statistics were credited for 23 pages viewed. It did not get credit for the 200.
The little table at the right shows the number of pages you would need to download to have a complete series on your computer. If you wanted to be really complete in species pages you would need to download another 13,086 pages to have all the large maps and monthly distribution charts. Some people have actually done this. To make the pages work well on your computer you would need to create your own menu page. The buttons for last and next plates will not work off-line without changing the html code inside every page. While this may be a lot of work it could result in great convenience.
Andrei Sourakov is the collection coordinator at the McGuire Center in Gainesville, Florida. He uses MPG extensively when working in the collections. Because the museum's Wi-Fi connection is sometimes slow he decided a few years ago to download all the pinned specimen plates. He converted the downloaded plates into .pdf files, printed them, and made a large notebook that can be taken to any work table in the collection area where it helps curators to sort through material. He also has the .pdf pages on a laptop. Other workers at McGuire use the MPG website online.
Greg Raterman lives in a remote area of Ohio where he must subscribe to a metered wireless service in order to have high-speed Internet access. It can become expensive to view several thousand pages at MPG if it causes you to exceed the limits of your wireless plan. Last October, Greg downloaded the pinned specimen plates and enjoyed using them off-line at home. He also took them with him, on a laptop, on a camping and moth photography trip to Florida. He later download one of the series of living moth pages.
Andrei, Greg and I did not download a very large number of pages. There is no way for me to detect when someone downloads 25-150 pages. It looks like normal daily activity in the server logfiles. A bump of 5,000+ pages usually stands out in the awstats daily usage reports.
Perhaps we will one day figure out how to place a workable copy of MPG on a DVD and offer it to people who would like to be able to use MPG while off-line. Off-line usage contributes nothing to our monthly statistics, and the current extent of such usage is unknowable. But knowing that it is being done makes us aware that the number of pages viewed is understated in the statistics presented here.
How Many People Use MPG?
This is, for a variety of reasons, a very difficult question to answer. While it is with some precision possible to measure website usage in terms of pages viewed it is not equally possible to measure people/users. For one thing, opinions may vary as to what constitutes a "user." The awstats reports tell us that in September of 2012 there were 18,554 visitors to the website, the highest visitor count during the most recent three year period. I think that is an unrealistically large number bearing little relationship to the number of "real users." Let me use an analogy to suggest a universe of "user groups."
The Washington Nationals and Atlanta Braves (two baseball teams) are playing a game at Nationals Stadium in Washington D.C. The primary interest group consists of the two teams (about 50 players, combined). To these we might also add all the coaches, umpires, stadium groundskeepers, front offices staffs of the two teams and their minor league farm clubs. There might be a total of 1,000-2,000 persons of primary interest. Of secondary interest we could count the fans actually present in the stands as well as all season ticket holders in the Washington area and in Georgia (perhaps 100,000 persons?). Finally, we might include in a wider universe a very large group that follows these teams on radio, television and in newspapers and online blogs (500,000-1,000,000 persons?). How do you define the Nationals/Braves user group or the MPG user group?
The only way to accurately determine the number of users of a website is to require each person who wants to access the site to register as a user with name and password. It would then be possible to count the number of visits by each person and the number of pages viewed. Doing this would dissuade some, perhaps many, people from using the website. Some would use false names, while others would forget passwords and register multiple times. We would probably get much better user statistics than is now possible, but at a cost of reduced use of the website and the annoyance of some users.
In the real world, the present system discussed more fully above, a visitor is represented in log files under two possible identifiers: IP Address or Hostname. These can be "static," the same every time a person visits, or "dynamic," assigned randomly by an Internet Service Provider at the time someone logs in to the service. In my case, in normal daily computer usage from my home, I use two computers side-by-side. One of them is always logged in to a static address at Comcast Cable and never changes (unless Comcast changes their operating system). My second computer is always logged in to AOL where dynamic addresses are used. When I log in to AOL I am assigned addresses from a pool of addresses, and I will keep the assigned address for up to 24 hours. During the course of a year I might be assigned 100 or more unique addresses. Therefore, at MPG, I may be counted as 100 or more visitors. Also, when we travel and I log in daily from motels or museums and other stopping points, I will appear in the MPG log files under additional identifiers.
This problem is by no means unique to me. Many people access MPG from home and from work using different ISPs in both places. They may also use wireless devices while traveling or at home where dynamic addressing is almost always used. The next example is from Mississippi State University where workers in the Entomological Museum using desktop computers are randomly assigned "clay-lyle dynamic" addresses and the same workers using laptop computers are assigned "wireless dynamic" addresses. None of these addresses identifies any individual. Addresses are assigned for up to 24 hours as in my case with AOL. During a twelve month period MPG recorded 151 different addresses from Mississippi State, including 41 from the Clay-Lyle Entomology Building. It is quite possible that another 41 addresses were used from laptops in Clay-Lyle, resulting in about 80 addresses used by a maximum of about seven staff persons (some of whom rarely access MPG). One of these addresses was used in eight different months. But we know that Richard Brown (and probably several others), visited MPG in all 12 months of the year. There is simply no way for us to know how many persons at Mississippi State made use of MPG in the 12 months under study.
It would be nice if we were able to supply data regarding the use of the website by a class of persons such as researchers or lepidopteran scientists. The next chart shows the number of pages viewed by persons affiliated with schools, other organizations and government agencies. It does not tell us how many individuals are represented nor does it distinguish between scientists, teachers or students. Nor does it tell the full story because many of these users also access the website from home or from mobile devices not hosted by their institutions: a lot of additional access by these individuals is recorded under .com and .net.
Most visitors to the website visit only fleetingly. Call them drop-ins or drive-by customers, they see only one or a very few pages and are present only for a few seconds. They represent 60%-75% of all visits to the website. I submit that real usage of the website begins with the 30 sec. to 2 min. group of users. This phenomenon of brief website visits is not unique to MPG. Very similar statistics are shown by the awstats reports for BugGuide and other websites.
In addition to viewing usage by "time cohorts" we can profitably do so by page counts. As we see above that most website visits are for just a few seconds, we can see in the chart below that a large majority of pages are dispensed to visitors who view just a very few pages (75% of users view less than 5 pages in a month). Here I submit that only the visitors recorded in the 10+ pages columns constitute the "real users" of the website. The cohort viewing 5-9 pages in a month is discussed below.
Visitors who view 5-9 pages in a month are an interesting group. More than 90% of them visited the website during only one month in a twelve month period. They cannot be considered regular users of MPG. None of these visitors visited the site during 9-12 months. Of the relatively few that were detected during 2-8 months the great majority were represented by dynamic or proxy IP addresses. They may very well be persons already counted in the "real users" cohort in the following Table 3.
The last table show the high-usage cohort of website visitors. Even in this cohort the data are heavily skewed toward persons who made use of the website in only one month out of the 12 month study period. Keep in mind that a person who began use of the website in July could at maximum have been counted during four months (November and December are from the preceding year). Even the single-month visitors did a serious amount of poking around and probably derived real benefit from visiting the website. But I would not count all of them within my concept of "real users." One may argue from these data that the number of serious users of MPG ranges from 2,477 to 18,624. My personal opinion based upon years of emails and anecdotal evidence is that the core user group of MPG may amount to 4,000-5,000 user/visitors.
When you consider that membership in the Lepidopterists' Society is not much more than 1,000 (half of whom are "butterfly people" or not particularly interested in the North American fauna), 5,000 amateur photographers, collectors and interested professionals using MPG in a serious way strikes me as an amazing statistic.
Lies, Damned Lies, and Statistics
It should be evident by now that website statistics are in part determined by the quality and quantity of available data, limitations imposed by data collection systems, interpretations placed upon those data and, it might be hoped, the elimination of biases by interpretors. Caveat emptor.
The awstats report for August 2012, in the section titled "Pages-URL," tells us that 446,854 pages were viewed and that these included 32,998 different pages-urls. This statement is factually correct but rather meaningless. There are potentially 13,000+ species pages and 13,000+ large map and distribution chart pages. These represent just two types of pages at MPG. More than 90% of all activity at MPG involves the use of just a handfull of page formats as shown in the following table that summarizes a year's activity as a typical or average month. A minor portion of overall website usage is not shown in order to keep the presentation simple.
Entry Pages (there are three of them) include the Main Menu and Plates Menu. There is also an early version of the Main Menu the "default page" that is seen when the simple URL "href=mothphotographersgroup.msstate.edu" is used. Many visitors store the URLs for these pages on their personal computers and use them as a quick method for getting to the MPG website. Visitors then make menu selections to view living moth or pinned specimen plates, species pages or other website content. In a number of cases (items 7-14) an item in the table is composed of a set of pages. There is usually a menu or index page which contains some text or introductory content and several subsidiary pages or plates. Page counts for the tutorial "A Walk Through the Moth Families" is shown here:
Item 12 in the previous table shows that, on average, 2,296 "Walk Through" pages were viewed over a twelve month period. This does not mean that 2,296 persons viewed those pages. In the table showing the Walk Through Series you will see that there were 3,589 pages viewed (in August 2012). That is the total for all the pages in the series. There were "only" 1,249 of the index page, the entry point to the series of pages. But only 414-525 pages of each of the subsidiary pages were viewed. Thus, it is not appropriate to claim that 3,589 or 1,249 persons took the walk through the series of pages. The maximum number of individuals who walked through the entire series could be no more than 414, and the actual number of such persons was almost certainly much lower.
Detecting Abnormalities and Errors
In the Pages-URL section of the awstats report for October of 2012 there are found two startling entries:
/larva.php?plate=6&page=2&sort=h (14,752 pages viewed)
/?-n+-s+-d+default_mimetype%3dh21tmis+ (2,013 pages viewed)
I have no idea what the second item is about. It should be counted, if at all, under not-viewed pages.
The series of larvae plates that month included 294 views of the index page and 44-161 views of each of the subsidiary pages. My guess is that the 14,752 count for plate=6&page=2 resulted from a robot gone haywire. I reduced that count substantially. Abnormalities in these amounts are fairly easy to detect. But they might be easy to overlook if the number of pages mis-counted were less than 500.
I am fairly confident that the numbers for adjusted data given in tables at the top of this paper are accurate to within 1%-2%. I wish that it was possible to be that accurate about visitor/user data, but that is not the case.
|