473,379 Members | 1,355 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,379 software developers and data experts.

Using STL map and set on large amount of data

I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

Thank you

Jul 23 '05 #1
13 12185
Eric wrote:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

Thank you


See if your compiler has an implementation of hash_set. It is
nonstandard, but many compilers have it as an extension. For example,
in g++ you access it in the namespace __gnu_cxx:

#include <ext/hash_set>
#include <string>

int main()
{
__gnu_cxx::hash_set<std::string> h ;
// Do interesting things here.
return 0 ;
}
This should have the same interface as std::set, but uses a hash table
instead of a balanced binary tree (the usual implementation of
std::set), therefore giving you expected O(1) insertion and lookup,
instead of O(lg n).

Here is some good documentation for hash_set:
http://www.sgi.com/tech/stl/hash_set.html

-Alan
Jul 23 '05 #2
Eric wrote:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?


Can you be more specific about any other constraints on your strings?
If the sole requirement is that they be unique, then you can just list
the first 10^7 in alphabetical order, starting with aaa...aaa.

If they need to be random then you can do various sneaky things that
won't affect the randomness too much (for most purposes). For example,
you can fix the first character of the string and generate 15 subsequent
random characters. Do this 10^7 / (number of distinct characters) times
for each distinct character. That will let you work with substantially
smaller data sets individually.

Other possibilities exist too, it just depends upon how much of your
system's built-in randomness you're willing to sacrifice.
Jul 23 '05 #3
Eric wrote:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}

Thank you


In addition to the ideas others have mentioned (e.g. generating strings
deterministicaly using some one or another listing algorithm, or using a
random string generator that does not generate duplicates over a period
large enough for your purposes) you might want to try a trie. This is a
data structure for storing sets of strings. It has the following
properties: insertion of a new string takes time proportional to the length
of the string (and not the number of strings in the set); testing whether
or not a given string is already a member of the set takes time
proportional to the length of the string (and not the number of strings in
the set).

So what's a "trie"? The name supposedly derives from "reTRIEval". Suppose
you have the set of strings {"HIS", "HERS", "HIM", "HE"} Then the trie
would look like (ascii art coming, hope you have a monospaced font):

+
|
+-H--+
|
+-E
| |
| +-$
| |
| +-R-S-$
|
+-I
|
+-M-$
|
+-S-$

Here "$" denotes end of string. Any path from the root to a leaf spells out
a unique string in the stored set. See how it works? See how you could use
it?

- Michael

Jul 23 '05 #4
Eric wrote:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer
Which compiler (version), Standard library, operating system?
Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}


Maybe your string template uses dynamic memory allocation. Try a key
class like this:

class key {
public:
key (const char* s) { ... }
inline friend bool operator< (const key& left, const key& right){...}
private:
char buf[17];
//...
};

Jul 23 '05 #5
The strings are randomly generated, so I can't start with a...a, then
a...b, etc.

Jul 23 '05 #6
Eric wrote:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}


If the template library provided with your compiler has a hash_set or a
hash_map you might want to use that.

If not you might want to ditch the map if you are only using it to
ensure uniqueness of the inserted strings and go for a vector.
The risk that a few of the values already exist is small (but is
there).
You want only 1.0E+7 values, the total number of available values is
2.23007E+43 (or 4.52313E+74 if case sensitive). The chance of getting
the same string (if you use a decent pRNG) is only 1 divided by
2.23E+36 (or 1 divided 4.52E+67 if case sensitive).

If this is to risky since you must have a guarantee for uniqueness
another trick is to generate 36 (or 62) different maps and assign 10
million divided by 36 (or 62) values to each map, where each map stores
only string starting with one specific alpha numerical value. Then do a
random access to one of the submaps when you need a value. You'll want
to generate more values for each map to decrease the chance that one of
the submaps will run out of values when you access it.
This trick reduces the number of values you need to insert into one map
thereby replacing the time it takes to go through the tree for several
million values by a constant time during creation of the strings and a
constant time increase when retrieving a value.

Jul 23 '05 #7
Eric wrote:
The strings are randomly generated, so I can't start with a...a, then
a...b, etc.


Well, how about the other idea I suggested then? In fact, you can do
even better on it and not sacrifice any randomness. Begin by assigning
10^7 balls into 26 bins (or however many chars you choose from). You
can make an array count[26] and increment a random element 10^7 times.
Then for each bin i, construct count[i] random, distinct 15 char strings
and prepend to each char[i]. This makes the individual problems
substantially smaller and still retains randomness.

More generally you can use on the order of sqrt(10^7) bins and compute
for each on the order of sqrt(10^7) strings. For example, a bin for
every possible first two characters.
Jul 23 '05 #8
ve*********@hotmail.com wrote:
You want only 1.0E+7 values, the total number of available values is
2.23007E+43 (or 4.52313E+74 if case sensitive). The chance of getting
the same string (if you use a decent pRNG) is only 1 divided by
2.23E+36 (or 1 divided 4.52E+67 if case sensitive).


No, it's more than that though still very small. Look up the birthday
problem.
Jul 23 '05 #9
True, I should have used P(k)= 1- n!/(n-k)!*n^k for accuracy.
Now to just to write a program that can handle the numbers involved :)

Jul 23 '05 #10
Me
Eric wrote:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

I used STL set and map but after about 5 000 000 entries, it becomes
very slow, even if I still have enough RAM available on my computer

Do you have any advice?

Here a code sample that I used to ensure uniqueness.

typedef pair<string, bool> StringBool_pair;
map<string, bool> MapValues;
pair<map<string,bool>::iterator,bool> pr;

for (int iii=0; iii<10000000; iii++) {
strCode = random16CharacterCode();
pr = MapValues.insert(StringBool_pair(strCode, true));
if(pr.second == true) {
// Accepted...
} else {
// Rejected...
}
}


You're thinking way too hard on this one. All you need to do is make a
function to generate a non-repeating random number in the range [0,
some number in the range 10**7-1 and 16**6-1] and then convert it to a
fixed length hex string of length 6 and call that function 10**7 times.
"Pseudo-Random Traversal of a Set" on
http://www.developer.com/tech/articl...0923_3496381_3 shows one
easy way of doing this.

Jul 23 '05 #11
On 2005-07-07 22:13:41 +0100, "Eric" <er*********@loto-quebec.com> said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.


Not too sure why people should help you generate 10 millions fake
credit card numbers ...
--
JFB

Jul 23 '05 #12
verec wrote:
On 2005-07-07 22:13:41 +0100, "Eric" <er*********@loto-quebec.com> said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.

Not too sure why people should help you generate 10 millions fake
credit card numbers ...
--
JFB


Alphanumeric credit-card numbers?
Looks more like TAN codes to me, but whatever :-)
Jul 23 '05 #13
On 2005-07-13 00:28:57 +0100, Paul Groke
<sw**********@der-ball-ist-rund.net> said:
I need an effective way (time is my main concern here) to generate 10
000 000 unique alphanumeric strings of 16 characters each.


Not too sure why people should help you generate 10 millions fake
credit card numbers ...
--
JFB


Alphanumeric credit-card numbers?
Looks more like TAN codes to me, but whatever :-)


Hmmm. How would _you_ go about asking for help without
revealing too much of your private agenda?

In case it didn't occur to you, alphanumeric is a proper
superset of numeric ... and thus, any help he can get on
the first, immediately applies to the second.

Call me paranoid :)
--
JFB

Jul 23 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: steve | last post by:
Hi, I have researched but have not found a good solution to this problem. I am importing large amounts of data (over 50 Meg) into a new mysql db that I set up. I use >mysql dbname <...
5
by: Mike | last post by:
This is a general question on the best way to import a large amount of data to a MS-SQL DB. I can have the data in just about any format I need to, I just don't know how to import the data. I...
121
by: typingcat | last post by:
First of all, I'm an Asian and I need to input Japanese, Korean and so on. I've tried many PHP IDEs today, but almost non of them supported Unicode (UTF-8) file. I've found that the only Unicode...
1
by: Bart | last post by:
Dear all, I would like to encrypt a large amount of data by using public/private keys, but I read on MSDN: "Symmetric encryption is performed on streams and is therefore useful to encrypt large...
11
by: Macca | last post by:
Hi, I'm writing an application that will pass a large amount of data between classes/functions. In C++ it was more efficient to send a pointer to the object, e.g structure rather than passing...
3
by: kamran | last post by:
Hi, I have a web service that may return a very large amount of data. I want that data to return in chunks, like first return 10% of data than return the next 10% and so on, until all is...
2
by: ShawnD | last post by:
I'm having some issues when trying to read input off of a pipe using a python script. I'm trying to process packet data from tcpdump in real-time, so it's a filter that needs to read data while the...
7
by: =?Utf-8?B?TW9iaWxlTWFu?= | last post by:
Hello everyone: I am looking for everyone's thoughts on moving large amounts (actually, not very large, but large enough that I'm throwing exceptions using the default configurations). We're...
16
by: Jack | last post by:
I need to process large amount of data. The data structure fits well in a dictionary but the amount is large - close to or more than the size of physical memory. I wonder what will happen if I try...
13
by: trpost | last post by:
I am looking for a way to send data from one page to another as POST data without using forms or cURL. I have a php script that is passing a list of cases from on page to another when a link is...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.