Text mining on large database (data mining)

Posted by yox on Stack Overflow See other posts from Stack Overflow or by yox
Published on 2010-04-13T22:16:15Z Indexed on 2010/04/13 22:23 UTC
Read the original article Hit count: 632

Filed under:

database

Hello,

I have a large database of resumes (CV), and a certain table skills grouping all users skills.

inside that table there's a field skill_text that describes the skill in full text.

I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..

Here are some examples skills extracted from the DB :

Sectoral and competitive analysis
Business Development (incl. in international settings)
Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
Creative work (Photoshop, In-Design, Illustrator)
checking and reporting back on campaign progress
organising and attending events and exhibitions
Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.

The output shoud be something like :

Sectoral and competitive analysis
Business Development
Specific structure and road design software -
Macao
AutoCAD
Photoshop
In-Design
Illustrator
organising events
Development
Aptana Studio
PHP
HTML
CSS
JavaScript
SQL
AJAX
Mix marketing
Viral Marketing
Social network marketing
emailing
SEO
One to one marketing

As you see only skills remains no other representation text.

I know this is possible using text mining technics but how to do it ? the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?

thanks

Developer IT