From what I know and what I can posit based on my experiences to help you achieve your goal of mining data so as to “hear to customers’ voice more clearly”.
My background lies partially in Internet marketing, Internet research and freelance writing, which these days doesn’t mean so much that I am a webmaster and write converting sales copy for offerings of goods and services that appeal to visitors who are seeking what I offer. I say this not to identify myself, but to give you a few points of reference for the content of this document. I’ve written this informally, without following a particular style guide because I felt that might inhibit my expression of these concepts and I am unfamiliar with seeing this in those formats as I use these and related concepts to make a self-employed living.
You may have inferred or already know of some of what I will mention from the fact that you mention social networks and public forums where established communities provide a niche “home on the Internet” and also platforms for sole customers to share they’re feelings, thoughts, observations and experiences with the world (anyone), their peers, their friends and families.
Informal channels like social networks, forums and website comments not only allow a glimpse into the life of a patient, but also allow for entities interested in doing so to catch a glimpse of the truths, statements, opinions and statements that are published by patients and others in a more natural environment where there is less reason for bias and more reason for an individual or individuals to share their statements in a more complete manner, whatever the motivation may be that is driving each particular individual to share. This is also done in real-time.
There also is a global truth that not many will point out, that sometimes people make statements about pharmaceuticals based upon second-hand or third-hand information which may have been misinterpreted. Rumor is a constant in the Internet and what’s more is that people exploit rumor daily or may wish to discredit or defame a company, product or individual and thus provide false information. The person or persons providing this false information might be related to a company competing or hoping to compete with the product that is being discussed or might be a disgruntled employee. Through methods I will describe it is possible to shed some light on the identity or identities of individuals who make statements about a pharmaceutical product, company to help speed the process of obtaining factual information through mining methods that you, the seeker has quite evidently already obtained and those that I will also describe and any that can be developed from the information I am about to provide in this document.
Heterogeneity is a valid concept when applied to data mining multiple sources, however the term “easily accessed” doesn’t apply when information is gathered by an “intelligent agent” computer programand then mined by a computer whilst being simultaneously interpreted by a workforce comprising one or more people. With a little script-writing or programming done by a computer programmer or Information Sciences specialist, this information can be gathered automatically, only limited by your script or program and your server(s) computer hardware and the target website’s server(s) computer hardware. and formatted in any way that you desire in any file type that you desire.
Fragmented information can be an issue, however using phrases mined in statements relevant to the subject matter (the pharmaceutical) and variations of these phrases allows for a much clearer picture to be developed. This is not limited only to what one might expect to be straightforward. For example, if one is searching for patient satisfaction and finds a relevant statement from a patient while mining blog comments that states, “Lilly’s Zyprexa allowed me to have a the piece of mind and freedom from the anxiety that relying only upon myself and talking to my therapist didn’t give me”, processing this language allows similar statements to be found that are also relevant in the following manners which can be used to develop a methodology for generating similar language in searches for statements relevant to patient satisfaction.
Simply put, generating synonyms of particular terms and similar statements and using these as search criteria makes for far better results than searching for (using google and similar search engine syntax) “Zyprexa” +”Feelings” or searching for “Zyprexa” + “My Feelings”, though these search terms will likely find a blog entry with this in it’s title or text. This may be done in many ways, the best of which I feel is done by people and not programs. Plugging people into a program (this is not quite so occult as it might at first impression seem) via an API such as Amazon’s Mechanical Turk ( http://www.mturk.com ) by paying users of MTurk to perform a task, in this case generating language that makes sense as alternate language to the statement “Lilly’s Zyprexa allowed me to have a the piece of mind and freedom from the anxiety that relying only upon myself and talking to my therapist didn’t give me”. A better way to state this might be that you need to generate a “common sense” database which helps find relevant descriptions used to describe and state when the object of the descriptions and statements is the pharmaceutical product being investigated. This can be done by Amazon MTurk and the best possible example that I can cite for this is the MIT Media Lab’s Open Mind Common Sense project. ( http://commons.media.mit.edu/en/) ( http://openmind.media.mit.edu ) There are several databases here and you may wish to start in a more generalized manner limited to the languages you wish to search for patient statements in or use this resource as both a template for a language reference in regards to pharmaceuticals and medecine, from which you will then have a database to generate other databases relating to particular pharmeutical product subjects. Also finding commonalities by mining all statements for word density and phrase density, yielding additional terms to use in search queries. And even moreso, you can use the current OpenMind project to generate statements that might have been more ambiguous, not mentioning specific drug names. There is no exact url for the page of search results, so I rely upon you to use the MIT Open Mind search using the query “drug” so that you can verify that related terms can be generated, much in the way that Google uses common sense and previously searched terms to generate suggestions when you type a search term in the http://www.google.com search box.
A sampling of the information generated is as follows:
Knowledge about drug
Similar concepts: drug rational think medicine vitamin c scissor glasses text book eraser physics praise
Below this, the following that statements are made, to which the Open Mind program wishes to verify if it is correct to give weight to the statements or not by asking if the statement is correct.
Example: You are likely to find drugs in the bathroom. The options for choosing if it is correct are: 1)Yes 2)No and 3)Sort of
In this instance and also in general, sser accounts or token systems to differentiate users interacting with this interactive program play a fundamental role in tracking submitted data and also ensuring that there is a preventative measure against errors made by users clicking twice accidentally, giving them only one vote in the system to further clean up and properly weigh information in the database. This is a measure to help ensure that meaningful data is given priority and shown to be relevant to the entry that it is under. Also an interactive list that is used with up and down arrows for each user of the interactive program to rate whether each of the statements in the list are less or more accurate helps refine the list more than another. Automated methods often don’t provide the same level of discrimination when sorting data as to which items “make sense” so this is definitely a boon to building a very accurate database of statements.
Therefore, of harvested data with the terms “Zyprexa” + “My Feelings”, sourcing humans to interact with the database would ensure that “My feelings on zyprex is that it’s great and it can be bought at this online Canadian pharmacy for $80 per bottle” is given less weight or is discarded from the list while “Zyprexa produced akathisia and my feelings, I can’t describe how restless I felt, it drove me crazy” would be given more weight and kept to be further used in searches regarding the pharmaceutical product (in this example again, Zyprexa).
What I have described so far will also provide more relevant data and also give a resource for future searches to be generated based upon the information that was most heavily weighed by users. The goal of your search is to obtain what patients and others say and think about Zyprexa, not fact check it so this is in fact much more valuable than a method that might discard factually inaccurate data that “makes sense” in a statement such as “Zyprexa did wonders to relieve my back pain”. This allows for the categorization of broader-searched data and the most common misinterprations about a drug can be identified and then addressed (as couldn’t have been done before this) in informational materials or on the drug’s web site and information resources such as pamphlets and information disseminated to doctors by polling users statements rather than using a questionaire to ask users what they know, feel and think about a drug.
Lastly, I wish to cover means of identifying resources of information and which are cited most and given weight. This can be done in simple google searches as well but not to the same degree and not on a more “human” level. For instance, if you wish to find the places that “might” have a large number of community members that use a particular product, especially to generate a list of those sites which might not be so obvious as a place for discussion about the drug, a google search can be done to the drug’s website or entries on medical websites that cite information about the drug. First, using a google search to find the most highly ranked websites on a search engine relevant to a drug. Again, using Zyprexa as an example one searched for “Zyprexa” (with or without quotes) on the search engine. The top two results shown on google as of now (9/14/2010, 10:37CST) are http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000161 and http://www.zyprexa.com/
I feel it is important to search for the top results because assuming that an official website is the most relevant would be a mistake, though it is often the case that it will be linked back to. Google is still the number one Alexa ranking site, which means it has the most traffic of any site on the Internet.
Though the search (without the quotes) “link:http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000161” on google yields three results that I feel are irrelevant, is important to take into account what people see first on the search engine, which is the nih.gov link to mine the communities and users who cite a resource for information about the drug. However, searching for (without the quotes) “link:http://www.zyprexa.com” as well as the url itself “http://www.zyprexa.com/” will yield sources of information the drug, some of which people can comment on and also communities where people will have cited the website as a reference.
There are now and have been Internet-marketing methods that apply universally for mining web data. Generating results from particular “footprints” which can be done manually in automated manner (this can be automated with tools like Scrapebox ( http://www.scrapebox.com/ ), a url harvesting and spam tool). For instance, to find wordpress blogs that mention Zyprexa so as to get blog posts and comments on the post, one might search on google for “Zyprexa” “powered by wordpress” and “Zyprexa” “powered by wordpress” “leave a comment”.
This also works when searching by forums. Essentially, adding a term that is placed on pages for popular blog and forum software allows you to find more accurate results when looking for “real people”, actual patients rather than drug information sites, medical sites and article directories which are often highly ranked in search engine results.
Here are more examples of “footprints” that allow the searching party to find blogs and forums:
“Powered by BlogEngine.NET 184.108.40.206?
Inurl:.gov “”Powered by BlogEngine.NET 220.127.116.11?
Inurl:.ac.uk “Powered by BlogEngine.NET 18.104.22.168?
“Powered by PHPbb”
“Powered by vBulletin”
“Powered by SMF”
Due diligence may be done on the type of platform, forum, blog or content management software used by one resource that is found to be rich in information that is relevant so that it can be used as a search query for communities or resources like it.
Furthermore, there are many other types of “footprints” for websites. For instance, conversely if you wished to search for information on government websites and web resouces rather than patient feedback, Inurl:.gov and for educational institution websites, Inurl:.edu narrow search results. Inurl:.com, Inurl:.net and Inurl:.org help find results that will more likely hold statements from patients as .info domains and some other top level domain extensions are often used for spam sites. If you are searching for information in a particular demographic or language such as Japanese users, beginning your search with (without quotes) “Inurl:.jp” is of great assistance and searching within second-level domains such as .ac.jp with “Inurl:.ac.jp” will yield educational institution and academic results. Inurl also works for common subdomains such as a forums by searching for (without quotes) “Inurl:forum” plus whatever your search query may be (the drug name, it’s commercial names and to further the search, I suggest doing additional search queries with those terms plus patient and patients and other search terms. I hope that with the above, I have illustrated how you might be more accurate in finding websites that are more relevant to patients than to commercial interests or interests which likely won’t be useful to your search for “the patients’ voice”.
Now, because facebook isn’t indexed as well as some other sites and information is not as readily available on it I will mention it first. Facebook has an internal search feature, it’s directory. This directory has been used in the past and recently 100 million of users’ interests, names and any information they had publically listed on their Facebook profiles was mined and posted by a security researcher to be shared via peer to peer file sharing. ( http://www.pcworld.com/article/202167/the_facebook_data_torrent_debacle_qanda.html)
Personally, I don’t believe this was a breach of privacy and falls under data mining and ethical hacking, but that is for each person, group and organization to decide and approve or disapprove as they will. Finding a patient’s voice will mean being close to them which might be construed as spying on a patient but in marketing and when getting feedback on a product, what is publically available, legal and personal is the best type of resource as it’s viable to be used without censure or legal repurcussion. One of the best resources has been showing up in Google search results, the Twitter ( http://www.twitter.com) live feed. Often search terms will show results from twitter which are in near-real time. These can be accessed via Twitter’s main page and also searched from within a program with the Twitter API ( http://apiwiki.twitter.com/Twitter-Search-API-Method%3A-search) and people have automated this to such a high degree that every tweet on a particular key word or key phrase can be mined and logged by the use of custom built programs or one particular program on a computer attached to the Internet, many computers or many Virtual Machines running the same program toward the task of searching for what people are stating in their “tweet” statements on Twitter about a particular thing. It is also possible to automate responses in the form of “tweet replies” and direct messages to these users, should one wish to contact them to ask a question about their tweet or express a particular sentiment toward them.
Another search result that is commonly seen on Google is “video results” and these days, many large corporations pay a great deal of attention to the video results that show up, some who’s management often search youtube.com and metacafe.com for disgruntled customers as a single person’s voice can be heard by many, up to millions of people when they publish an online video. Youtube is ranked the number three site in the world according to Alexa.com and MetaCafe.com is ranked number 211 world-wide. This means that they receive hundreds of thousands to millions of users watching videos each and every day. There have been many statements of how powerful sites like Youtube, Twitter, Facebook, Myspace and Friendster can be and it’s 100% true. People know this and submit their appreciations, opinions and grievances on these sites and most of these services are easily searched with search engine queries or have actual API interaction that can be built into programs that could be used for data mining. A simple search (without the quotes) of site:facebook.com +zyprexa illustrates how people’s opinions are expressed on facebook.com, the world’s 2nd most frequented website and this is the view from an outsider with a “bigger picture” than the perspective from those “inside” facebook.com, the users. This “big picture” does not negate the personal nature of is very important because ideas that are made into fan pages or entities that have many facebook.com friends or twitter.com followers can generate “virally spreading” concepts, ideas and statements in the same way that “viral marketing” campaigns are generated ( http://en.wikipedia.org/wiki/Viral_marketing) (http://www.wired.com/techbiz/media/news/2005/03/66960) & ( http://www.usatoday.com/money/advertising/2005-06-22-viral-usat_x.htm).
Harnessing the power that generates these viral concepts and being able to locate and react to statements and concepts that generate negative and positive situations can be done in real-time. This can amplify the effect so that the power is controlled and can be contained. If a competitor is making unfounded statements, a cease and desist order may be issued. If a patient being treated with one of your company’s drugs is unhappy, you can contact them directly very easily and in an automated manner to find out more information. All the videos on youtube regarding a drug can be indexed and even downloaded with programs. With a competent programmer and server (datacenter) resouces, all of the social polling and response can be in an automated manner. Issues can even be escalated to the awareness of management if meeting specific search criteria such as “death”, “made me sick”, “is the worst drug” whereupon they can be quickly reviewed and escalated to another department or futher up the chain of management to be reviewed again until reaching the correct department or person where the proper course of action can be decided and enacted.