A Simple Spider for Researcher Ranking Project

A important issue in ranking researchers is to construct the citation network between researchers.  To achieve it, I need crawl the database of cictation data from HistCite Website, whose URL is


I downloaded a python spider from web and revised it to make it useable. It is not perfect but enough for this project. It is really cool to watch command windows scrolling down and downloading thousands of papes. Wow, is it so called geek behaviour? 


Source Code

The original version comes from this website


Draft of my Report about ReRank Project.


This is the draft of my report about ReRank Project. It is sketchy but includes most points.

My basic idea is to transfer ranking problem into a network optimization problem.  Other guys have proposed an entropy maximization scheme to rank web pages in WWW, which is more fair then PageRank that Google uses. I borrowed this framework and analyzed the citation network formed by papers. Then I modeled the relation between researchers and papers with Bipartite Graph. By maximizing entropy again, I got the most general indicators of researchers’ popularity and influence.
The attachment is a brief summary of my idea.It’s highly appreciated if you can view it briefly before the discussion.
This is only basic idea of theoretical framework, there is long way to for before getting any meaningful result.
I will update it if I revise it later.

ReRank Algorithm Progress Mar 17

  Do you remember the idea of  Ranking Researchers? I spent the whole spring break to explore that idea.
       I read some articles about H-index, which is a research ability indicator used by Science and Nature. H-index can be summarized as: “the h-index of a researcher is h if he has exactly h papers whose citations are above h. “. Actually it only makes use of very limited information. Although the ignorance of large information make it robust and invulnerable for manipulation in some senses, it is disputable because the information filter is ugly defined by humans. In fact, we can define many h-index like indicators with different rank orders. It is impossible to justify which is the most fair one.
       Another method to rank researchers is to use citation analysis. Similar methods have been widely used in ranking webpages—probably the most famous one is the “PageRank” algorithm that powers Google. The essence idea of PageRank Algorithm is to calculate stationary probability distribution of random walk with surfer follows each out-link with equal probability.
PageRank works very well in ranking webpages. Yet it also relies on an assumption:”The surfer follows each out-link with equal probability”. The most fair ranking system should based on facts only and not rely on any human-defined assumption.
After removing this assumption from PageRank algorith, we get TrafficRank Algorithm.
The key ideas of TrafficRank Algorithm is to derive the most general(uncertain) conclusion of ranking order based on the existing information. Because the uncertainty of a system is characterized by entropy, it is actually an optimization problem to maximize the entropy. You can refer paper “A New Paradigm for Ranking Pages on the World Wide Web” by John A. Tomlin or the report I will post later for detail.
Both PageRank and TrafficRank target on webpages. Ranking researchers and ranking webpages share some characteristics, but they are different.
Webpages are connected only by link, while the relationship between researchers are much more complicated. My goal in the following days will be characterizing the relation between researchers with the suitable network model.



The Long Tail of Labor—Influence of Crowd Sourcing on Labor Market

Author: Jing Conan Wang

Email: hbhzwj


Abstract: This article describes the influence of Crowd Sourcing and Human Computation on the Labor Market. Crowd Sourcing and Human Computation help to break the large job which previously finished by employee into small tasks which will be distributed and accomplished by a crowd of people in a pleasant and rewarding method. As a result, it will increase the tail of labor market to a great extend and generate an extremely large commercial opportunity.

Long Tail dominates the Internet. Many interesting things happen when the entering threshold of a specific field approaches zero and thus generating a extremely long tail. For example, The success of Google Adwords and AdSenses owes to the long tail of Advertisement. The essence of long tail is to generate large profits from small needs.

Usually if you want to have a jobfinished. You need to hire an employee, sign contract with him, provide a working place, train him, describe the job to him and eventually pay money to him. It is very complicated. One benefit of such complexity is that employers are less willing to fire a person he hired because it will cause extra soaring cost of hiring new person. As an definite “employee”, I must thanks to the person who create this process as it protects me better than federal labor law.

Example of Cleaning your office

but at the same time, it also means that employers are less willing to hire person for small tasks. You may think your office is dirty, but except for the case that the office is dirty enough( the degree of “enough” depends on personality ), you would not like to hire a new cleaner. In this case, the threshold, namely the cost of cleaning the room by hiring a cleaner by yourself, is very large. The head of this market is those big companies which really have the needs to recruit cleaners by themselves.

Cleaning Company reduce the cost by maintains employee and employer relationships by themselves and you outsource your tasks to him. In this way, the cost of cleaning your office is reduced and the tail is thus extended.

What if a strange job? The tail of job

But that’s not enough. Cleaning room is a common task whose needs are large enough to support a company. What if you want find someone to “steal vegetables” for you in Happy Farm Game in 4AM everyday? I thinks less than 1 out of 10,000 people may have the similar need with you, and another 1 out of 10,000 people are willing to do that for a pay.

I don’t think there would be company to whom you can call to help you “steal vegetables” in 4AM every day. But in Internet Age, there is still way to meet your need, you can post the task in a website (these kind of website is not popular yet) and person who have interests can contact with you. Maybe a boy who lives 3 blocks away from your home may send you email saying that he is willing to do that for a 9 dollar hour-wage. The tail is extended for another time.

I just said “maybe”, personally I believe it is still highly possible that you get no response because few people can bear this work for more than a month. So why not break the tasks to smaller ones and hire person to “steal vegetables” for only one day? It will be much easier to find a person happens to be awake in 2 AM for a specific day(I often stay up in Friday so it is no bad to gain some bucks by clicking mouse for several times that day).

The general trend is that job is broken into many tiny parts which can be outsourced and finished by many people. In this way, the gap between demand and supply was greatly reduced. Here comes the questions, which kind of jobs can be broken, how to break these jobs, how to find people who are willing to do those and assign each part to them in a optimal way?

Unfortunately, the rule of “breaking” is not applicable for many jobs. For example, it is impossible for you to change your baby-sitter everyday, you don’t like that and your baby would not, too.

But there are still some applications. For example, wikipedia has proved that multi-person cooperation can generate high quality report. Yahoo Answer has proved that Internet can take place of consulting as least partially. Actually, most of the office work can finished by a crowd of people connected to Internet, so called “Crowd Sourcing”. That is to say, crowding sourcing and eliminate the job of most white collars.

People don’t need to sit in office building formally. Instead they only need to do what they want in Internet, like playing games, listening to music and so on. And they will finish other people’s tasks indirectly and get paid. THAT SOUNDS FANTASTIC!!

Why it is a revolution. How large the opportunity is?

The world of long tail is a world of monopoly, that is to say, the long tail of each specific field is always dominated by small number of companies. For example, Google dominates the long tail of Online Ad, Ebay dominates the long tail of E-commerce, so on and so forth

Although the total price of each task is tiny, the long tail can account for the 40% or more of the total value because of their massive number. Each year, hundreds of billions of money are invested into office-related jobs. That’s to say, the dominate company will have sales volumes of 40% of these hundreds of billions dollars, which amounts to at least 100,000,000,000$!! It would be a giant like Google and Apple.

When will it become a reality. What’s the problem make it impractical nowadays.

People do the things they like, and they get enough payment to support themselves. It seems to be utopia! However, this scenery is still impractical now, mainly because the technical limitation.
Both Crowd Sourcing and Human Computation is still in their early stage. Luis von ahn, the creator of human computation concept and the inventor and many games with purpose, also admitted that human computation game can only be used in a small range of field like image labelling, and he don’t know how to develop an general framework, either.
I think the development of human computation and crowd sourcing is closely related to Artificial Intelligence, Data Mining. Maybe it still takes another 5-10 years for this field to have significant breakthrough.


New Technology and New Company

The history of IT Industry witness the emergence and the fall down of many famous companies, most of which are labeled with a certain kind of technology. The development of these companies are tightly associated to corresponding technology.

For example, prior to the Microsoft Age, software was rarely considered as profitable, let along become the center of whole IT Industry. It is Microsoft led by Bill Gates that open the gate to proprietary software age. Microsoft teaches people all round the world (may be except China) that software development should be well respected as the other essential part of PC in spite of Hardware.
In the mid 90s, few people realized the power of Internet. Yahoo was one of the first companies that really treated Internet as paramount new media which can totally change the method of information distribution. Since then online AD gradually become the mainstream ad distribution method. Cherry Yang’s efforts greatly damage the traditional media like TV and Newspaper and reshuffle the whole media industry.
Similarly, another Internet giant Google has totally changed people’s way to get information since its setup at 1998. With novel PageRank technology, Google shows web users organized information indexed by millions of key worlds instead of orderless web pages.
Now Microsoft, Yahoo and Google have become the synonym of Software, Online Ad and Search Engine. It is new technology that creates the prosperity of these IT Giants. However, this is not always the case, new technology doesn’t always means new successful company.
Take Netscape as example. The Netscape Navigator represented the advent of Internet Age with rich multimedia content. With this revolutionary product, Netscape became the favorite of the Wall Street and its market value surplus 5 billion dollars in a short time. This situation was totally changed when Microsoft entered the field of web browser. In short 3 years, Netscape suffered from sever defeat in the both web browser and intranet market. Under the giant financial pressure from the Wall street, Netscape finished its brilliant and short life and was sold to AOL.
Compared with the trategy of Netscape, the ending of youtube is more pleasant. The creators sold youtube to Google with a high price, at which time youtube was burning money and didn’t have clear profit model. The two creators gains 300 million dollars respectively. Many other companies have the similar happy ending, like hotmail, doubleclick.
All of these companies hold new technology that has total redefined a market, however some of them succeed at last and some other are sold. We must admit that creators’ characteristics play an important role in the process. But creators’ choice is not the only reason. the characteristic of the technology and market condition have pre-decided the destiny of each company.
To produce a new IT giant, a new technology must has the following characteristics:

1.  Foundational Application with huge market demand
Not all technology is enough to support a huge IT giant. OS, online AD and search engine all share a common point: it is a necessity and foundation of whole industry. Every PC needs Operating System and all other software must be build on API of operating system. Online AD is in the base of Internet Economy, no Internet Company without the support of Online AD. Search Engine is the portal of Internet and it directly determine other websites’ flow, which is essential to generate profit.
Although applications like google map is awesome. It is not enough to support a independent company because map service is not a basic application of Internet Industry. It is in the upper layer of online ad and search engine. Map services rely online ad to make profits and search engine to import users. However, no other service is based on map service. Mail serice is similar. That’s the reason why Google Map, Gmail and Hotmail indeed cannot operated by independent company, but only by Search Engine company or Online AD provider.
2. Easy to understand.
No investor want to invest in a program he doesn’t understand. No matter how good the technology is, you cannot even start without the initial investment.
3. Gap with existing technology Or Ignorance of Existing Gaints.
The fall down of Netscape owes to its violation of this condition. Despite its importance, Web browser is indeed a simple software which is easy to reduplicate. At first, Microsoft didn’t realize the importance of web browser and didn’t pay much attention to this emerging market. However, the improper speech of Netscape’s CEO and the alliance of Netscape and Sun, the main competitor of Microsoft at that time, infringed the godfather of software industry. A crazy revenge started when Microsoft bind their own web browser to OS. In fact, it took Microsoft only several months to develop their own web browser and performance of IE caught up with Netscape Navigator in one year. With the profit generated by Windows OS and Office suite, Microsoft had no intend to earn money before Netscape is closed down.
4. Low Monetary Investment At the beginning Stage And Foreseeable Profit Model.
This can be used to explain why youtube is sold to Google. It is well known that online video is a promising mainstream application in the future. And youtube has solidify its own consolidate its leading position in video sharing market.  If the cash of youtube is enough to support itself for 3 year or long, there is no need for steve chen to sold youtube. However, online video uses so many network bandwidth that few investor has patience and enough money to wait for such a long time. Besides, if youtube has forseeable profitable profit model which can helps to persuade investors, youtube can walk longer in the way of independent development.

Unfortunately, youtube lacks both of the essential conditions. As a result, it is reasonable to accept the price when Google bids for 1.5 billions.


Memory of Music

I was often jeered by others as a Desafinado in my childhood because of my voice. since my parents are average workers with little musical literacy, I have never received musical education. In fact, my knowledge of music is just a electrical organ brought by my grandparents, which was broken by myself later.

This situation continues until my sophomore year in the university. At that time, SECRET, Jay Chou’s first movie, was very popular among students. The melancholy tune of the movie greatly impressed me, activating my inner musical desire. Another reason why I like this movie is that I was in love with a girl similar to XIAOYU, the heroine of the SECRET. I have presentiment that the story between that girl and me will have the same ending with the movie. Although I tried my best, the predetermined thing still happened at last. To avoid the agony of departure, I plunged myself into Mathematical Modeling and Music. The sound of music was like a shield, protecting me from any sad sentiment. 

I also chosed piano as my optional class, which was a casual but lucky thing for me. I still remember the first day when I stepped into the art building of my university. It was a small yet elegant building located in a corner of the campus, a building can be easily recognized because of music drifted from its windows. I once introduced the art building to my foreign teacher as follows: "when you close your eyes and hold your breath, your heart will bring you to a place, that’s the art building." Whenever you cross the narrow corridor decorated by portraits of famous musicians, you can met one or two students with a score in their hands.

Mrs. Jiang, the teacher of the piano class, is a very beautiful and eager woman. She has spent all her efforts in the past 10 years, in her words, to spread the beauty of music to average students in HUST. Thanks to her efforts, I have the chance to touch the classical music. Since the number of students greatly surplus the capacity of Mrs. Jiang’s class, some students were transferred to the members of keyboard team to learn piano skills in detail. I learned piano firstly from Song Hongming, then from Liu Haokun. Song is a very lovely boy who has learned piano for 10 years. One of his habits is to collect the data of buses. It was said that he can remember all the bus number in Wuhan and Huangshi. Liu Haokun, the captain of the keyboard, is a charming person with high techniques not only in classical suites but also in pop pieces.

I have grown much in the past year accompanied by piano. Although my techniques are terribly bad in some senses, I like listening to the piano suites which can bring me to a holy state. Chopin and Beethoven are my favourite composers. Chopin’s nocturnes give me a sense of serenity and elegance, while his Polonaise show the Spartan side of the Piano Poet. Beethoven’s works has a distinctive mark of romance. Although his Pathetique was created before his deafness, the sonata still show the composer’s determined desire to take Fate by the throat. My favourite pianist are Li yundi and Horowitz. Yundi, the champion of 14th International Frederic Chopin Competition, has a unique temperament similar to Chopin, making him suitable for the expression of Chopin’s melancholic elegance. Horowitz is undoubtedly the most renowned classical pianist in the 20th century. His performance can always set up resonance in my heart.

The most lesson I learn from piano, or classical music in general, is that the the maturity of heart is also important to a person in spite of the accumulation of knowledge. The meaning of life is not just about the promotion of social state, bust also about the consummation of the soul.

(Welcome suggestion for grammers or styles)


Some beautiful sentences from “The Gadfly”

The Gadfly  written by Ethel Lilian Voynich is a very important book which I’ve read from several times in the past 4 years.

These are some beatiful sentences of the book:

He pointed to the valley below them. Arthur knelt down and bent over the sheer edge of the precipice. The great pine trees, dusky in the gathering shaes of evening, stood like sentinels along the narrow banks confining the river. Presently the sun, red as a glowing coal, dipped behind a jagged mountain peak, and all the life and light deserted the face of nature. Straightway there came upon the valley something dark and threatening—sullen, terrible, full of spectral weapons. The perpendicular cliffs of the barren western mountains seemed like the teeth of a monster lurking to snatch a victim and drag him down into the maw of the deep valley, black with its moaning forests. The pine trees were rows of knife-blades whispering: “Fall upon us!” and in the gathering darkness rows of knife-blades whispering: “Fall upon us!” and in the gathering darkness the torrent roared and howled, beating against its rocky prison walls with the frenzy of an everlasting despair.


my first blog written by ScribeFire Blog Editor 3.4

This is my first blog written by ScribeFire Blog Editor 3.4. It is a very useful addon for firefox browser. You can download it here:


About Research

I’ve done research under the supervision of Prof Jiaqing Huang for almost a year. But in my view, Prof Huang is indeed not a good supervisor. The reason I hold this opinion is that he always pushs me to publish papers.

It is very hard for an undergraduate student to publish papers. Without a deep understanding of a field and the support of solid result, it is almost impossible for a researcher to locate the research direction and write a excellent paper. Both solid understanding and solid result requires devotion of time.

The deadlie of GlobeCom(Mar. 15) is approaching. Prof. Huang requires me to submit a paper to this conference. However, I couldn’t spend enough time in research because of my application. Although I’ve gotten some preliminary result, I still need a large amount of literation reading to support my research.

But time is not enough. To make the matter worse, prof. Huang have little knowledge of the research field I am involved. I couldnot find a profess familiar to my field, either.

Because of this reasons, I hold little hope for this paper.


This is my first live blog

Since Google Sites has been blocked in China, I lose a place to record my experience. For now on, I’ll record my research diary in Live Blog.